| Model Type | | text generation, multimodal | 
 | 
| Use Cases | 
| Areas: | | research, commercial applications | 
 |  | Applications: | | question answering, dialog, content creation, mobile device operation, robot operation | 
 |  | Primary Use Cases: | | video-based question answering, multimodal analytics, language understanding in various languages | 
 |  | Limitations: | | no audio support, updated until June 2023, limited recognition of individuals and IP, weak spatial reasoning skills | 
 |  | 
| Additional Notes | | The model supports local files, base64, and URLs for input images. Limitations in spatial reasoning and complex instruction handling are noted. | 
 | 
| Supported Languages | | en (high), zh (high), fr (medium), de (medium), es (medium), ja (medium), ko (medium), ar (medium), vi (medium) | 
 | 
| Training Details | 
| Data Sources: | | MathVista, DocVQA, RealWorldQA, MTVQA | 
 |  | Methodology: | | Naive Dynamic Resolution, Multimodal Rotary Position Embedding | 
 |  | Model Architecture: | | Multimodal architecture supporting images, video processing | 
 |  | 
| Input Output | 
| Input Format: |  |  | Accepted Modalities: |  |  | Output Format: |  |  | Performance Tips: | | Use flash_attention_2 for better acceleration and memory saving. | 
 |  |