| Model Type | | text generation, multimodal | 
 | 
| Use Cases | 
| Areas: | | commercial applications, research | 
 |  | Applications: | | visual question answering, dialog systems, content creation, device integration | 
 |  | Primary Use Cases: | | Visual QA, Dialog, Content Creation | 
 |  | Limitations: | | Lack of Audio Support, Data timeliness until June 2023, Recognition of specific individuals or IPs, Limited handling of complex instructions, Insufficient counting accuracy, Weak spatial reasoning skills | 
 |  | 
| Additional Notes | | Available in different quantizations for broad hardware compatibility. | 
 | 
| Supported Languages | | en (>=0.8), zh (>=0.8), fr (>=0.8), es (>=0.8), de (>=0.8), ru (>=0.8), ja (>=0.8), ko (>=0.8), ar (>=0.8), vi (>=0.8) | 
 | 
| Training Details | 
| Data Sources: | | MathVista, DocVQA, RealWorldQA, MTVQA | 
 |  | Methodology: | | Instruction-tuning, GPTQ quantization | 
 |  | Model Architecture: | | Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-ROPE) | 
 |  | 
| Input Output | 
| Input Format: | | Message-based input with role specification | 
 |  | Accepted Modalities: |  |  | Output Format: | | Textual descriptions or responses | 
 |  | Performance Tips: | | Use flash_attention_2 for acceleration in multi-image and video scenarios | 
 |  | 
| Release Notes | | 
| Version: |  |  | Notes: | | Quantized relay in 2B format, instruction-tuned. | 
 |  | 
 |