| Model Type | | text generation, multimodal |
|
| Use Cases |
| Areas: | | Research, Commercial applications |
|
| Applications: | | Visual understanding, Video-based question answering, Dialog, Content creation |
|
| Primary Use Cases: | | Operates mobile devices, Interacts with robots |
|
| Limitations: | | Lack of Audio Support, Limited Capacity for Complex Instruction, Weak Spatial Reasoning Skills |
|
| Considerations: | | The model's limitations mentioned above. |
|
|
| Supported Languages | | English (Full proficiency), Chinese (Full proficiency), European languages (Supported), Japanese (Supported), Korean (Supported), Arabic (Supported), Vietnamese (Supported) |
|
| Input Output |
| Input Format: | | Messages with roles and content (image, video, text) |
|
| Accepted Modalities: | |
| Output Format: | |
| Performance Tips: | | Enable flash_attention_2 for better acceleration and memory saving. |
|
|
| Release Notes |
| Version: | |
| Notes: | | Includes multimodal capabilities and supports various visual understanding tasks. |
|
|
|