| Model Type | | omni-interactive, multimodal, text generation, speech-to-speech |
|
| Use Cases |
| Areas: | | research, interactive applications, voice assistants |
|
| Applications: | | multimodal interaction, speech-to-speech conversations |
|
| Primary Use Cases: | | real-time speech output, understanding images, audio, and text |
|
|
| Additional Notes | | Uses whisper for audio encoding, clip for image encoding, snac for audio decoding, and CosyVoice for generating synthetic speech |
|
| Supported Languages | |
| Training Details |
| Data Sources: | | OpenOrca datasets, MOSS, whisper |
|
| Methodology: | | Three-stage training: encoder adaptation, modal alignment, and multimodal fine-tuning |
|
| Model Architecture: | | Uses multiple sequences for input and output to perform comprehensive tasks |
|
|
| Input Output |
| Input Format: | | Concatenated image, audio, and text features |
|
| Accepted Modalities: | |
| Output Format: | | Real-time speech responses guided by text |
|
|
| Release Notes |
| Version: | |
| Notes: | | Release of the model, technical report, inference, and chat demo code |
|
|
|