| Model Type | | text generation, multimodal |
|
| Use Cases |
| Areas: | | commercial applications, research |
|
| Applications: | | visual question answering, dialog systems, content creation, device integration |
|
| Primary Use Cases: | | Visual QA, Dialog, Content Creation |
|
| Limitations: | | Lack of Audio Support, Data timeliness until June 2023, Recognition of specific individuals or IPs, Limited handling of complex instructions, Insufficient counting accuracy, Weak spatial reasoning skills |
|
|
| Additional Notes | | Available in different quantizations for broad hardware compatibility. |
|
| Supported Languages | | en (>=0.8), zh (>=0.8), fr (>=0.8), es (>=0.8), de (>=0.8), ru (>=0.8), ja (>=0.8), ko (>=0.8), ar (>=0.8), vi (>=0.8) |
|
| Training Details |
| Data Sources: | | MathVista, DocVQA, RealWorldQA, MTVQA |
|
| Methodology: | | Instruction-tuning, GPTQ quantization |
|
| Model Architecture: | | Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-ROPE) |
|
|
| Input Output |
| Input Format: | | Message-based input with role specification |
|
| Accepted Modalities: | |
| Output Format: | | Textual descriptions or responses |
|
| Performance Tips: | | Use flash_attention_2 for acceleration in multi-image and video scenarios |
|
|
| Release Notes |
| Version: | |
| Notes: | | Quantized relay in 2B format, instruction-tuned. |
|
|
|