| Model Type | | text generation, multimodal |
|
| Use Cases |
| Areas: | | research, commercial applications, agent integration |
|
| Applications: | | visual understanding, video-based QA, mobile and robotic integrations |
|
| Primary Use Cases: | | visual question answering, dialog, content creation, multilingual support |
|
| Limitations: | | Lack of audio support, Updates up to June 2023, Limited individual/IP recognition, Limited complex instruction handling, Low counting accuracy, Weak spatial reasoning |
|
|
| Additional Notes | | Reports quantized model performance across various tasks, highlighting strengths in multimodal integration and weaknesses, such as limitations in audio and complex reasoning. |
|
| Supported Languages | | English (native), Chinese (native), European languages (high), Japanese (high), Korean (high), Arabic (high), Vietnamese (high) |
|
| Training Details |
| Data Volume: | |
| Model Architecture: | | Naive Dynamic Resolution, Multimodal Rotary Position Embedding (M-ROPE) |
|
|
| Input Output |
| Input Format: | | Images, Videos (local files, base64, URLs) |
|
| Accepted Modalities: | |
| Output Format: | |
| Performance Tips: | | Enabling flash_attention_2 recommended for better acceleration and memory saving. |
|
|
| Release Notes |
| Version: | | Qwen2-VL-2B-Instruct-GPTQ-Int4 |
|
| Date: | |
| Notes: | | Quantized model version with multi-language support and enhanced image and video processing capabilities. |
|
|
|