| Model Type | | text generation, multimodal |
|
| Use Cases |
| Areas: | | research, commercial applications |
|
| Applications: | | question answering, dialog, content creation, mobile device operation, robot operation |
|
| Primary Use Cases: | | video-based question answering, multimodal analytics, language understanding in various languages |
|
| Limitations: | | no audio support, updated until June 2023, limited recognition of individuals and IP, weak spatial reasoning skills |
|
|
| Additional Notes | | The model supports local files, base64, and URLs for input images. Limitations in spatial reasoning and complex instruction handling are noted. |
|
| Supported Languages | | en (high), zh (high), fr (medium), de (medium), es (medium), ja (medium), ko (medium), ar (medium), vi (medium) |
|
| Training Details |
| Data Sources: | | MathVista, DocVQA, RealWorldQA, MTVQA |
|
| Methodology: | | Naive Dynamic Resolution, Multimodal Rotary Position Embedding |
|
| Model Architecture: | | Multimodal architecture supporting images, video processing |
|
|
| Input Output |
| Input Format: | |
| Accepted Modalities: | |
| Output Format: | |
| Performance Tips: | | Use flash_attention_2 for better acceleration and memory saving. |
|
|