| Model Type | | multimodal, text-generation |
|
| Use Cases |
| Areas: | | research, commercial applications |
|
| Limitations: | | No audio support, Data timeliness issue post-June 2023, Limited capability in recognizing individuals/IPs, Weak in complex instructions, Object counting and spatial reasoning difficulties |
|
|
| Additional Notes | | Supports up to 20min video understanding, multilingual text understanding within images. |
|
| Supported Languages | | primaryLanguages (English, Chinese), additionalLanguages (Most European languages, Japanese, Korean, Arabic, Vietnamese), description (Multilingual support for text understanding in images.) |
|
| Training Details |
| Data Volume: | | Data updated until June 2023 |
|
| Model Architecture: | | Naive Dynamic Resolution and Multimodal Rotary Position Embedding (M-ROPE) |
|
|
| Input Output |
| Input Format: | |
| Accepted Modalities: | |
| Output Format: | |
| Performance Tips: | | Set min/max pixels for optimal speed and memory usage |
|
|