| Model Type | | text generation, multimodal | 
 | 
| Use Cases | 
| Areas: | | research, commercial applications, agent integration | 
 |  | Applications: | | visual understanding, video-based QA, mobile and robotic integrations | 
 |  | Primary Use Cases: | | visual question answering, dialog, content creation, multilingual support | 
 |  | Limitations: | | Lack of audio support, Updates up to June 2023, Limited individual/IP recognition, Limited complex instruction handling, Low counting accuracy, Weak spatial reasoning | 
 |  | 
| Additional Notes | | Reports quantized model performance across various tasks, highlighting strengths in multimodal integration and weaknesses, such as limitations in audio and complex reasoning. | 
 | 
| Supported Languages | | English (native), Chinese (native), European languages (high), Japanese (high), Korean (high), Arabic (high), Vietnamese (high) | 
 | 
| Training Details | 
| Data Volume: |  |  | Model Architecture: | | Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-ROPE) | 
 |  | 
| Input Output | 
| Input Format: | | Images, Videos (local files, base64, URLs) | 
 |  | Accepted Modalities: |  |  | Output Format: |  |  | Performance Tips: | | Enabling flash_attention_2 recommended for better acceleration and memory saving. | 
 |  | 
| Release Notes | | 
| Version: | | Qwen2-VL-2B-Instruct-GPTQ-Int4 | 
 |  | Date: |  |  | Notes: | | Quantized model version with multi-language support and enhanced image and video processing capabilities. | 
 |  | 
 |