Model Type | text generation, multimodal |
|
Use Cases |
Areas: | Research, Commercial applications |
|
Applications: | Visual understanding, Video-based question answering, Dialog, Content creation |
|
Primary Use Cases: | Operates mobile devices, Interacts with robots |
|
Limitations: | Lack of Audio Support, Limited Capacity for Complex Instruction, Weak Spatial Reasoning Skills |
|
Considerations: | The model's limitations mentioned above. |
|
|
Supported Languages | English (Full proficiency), Chinese (Full proficiency), European languages (Supported), Japanese (Supported), Korean (Supported), Arabic (Supported), Vietnamese (Supported) |
|
Input Output |
Input Format: | Messages with roles and content (image, video, text) |
|
Accepted Modalities: | |
Output Format: | |
Performance Tips: | Enable flash_attention_2 for better acceleration and memory saving. |
|
|
Release Notes |
Version: | |
Notes: | Includes multimodal capabilities and supports various visual understanding tasks. |
|
|
|