| Model Type | | Multimodal, Vision-Language |
|
| Use Cases |
| Areas: | | Research, Commercial applications |
|
| Applications: | | Visual question answering, Image comprehension |
|
| Primary Use Cases: | | Multi-round text-image conversations |
|
| Limitations: | | Supports text-image conversations but not image-to-video., May hallucinate content not present in images., Resolution limited to 448x448. |
|
| Considerations: | | Evaluate potential risks before adopting. |
|
|
| Additional Notes | | Supports image understanding at 448ร448 resolution; ongoing limitations may affect certain applications. |
|
| Supported Languages | | English (proficient), Chinese (proficient) |
|
| Training Details |
| Data Sources: | | LAION-400M, CLLaVA, Flickr, VQAv2, RefCOCO, Visual7w, GQA, VizWiz VQA, TextCaps, OCR-VQA, Visual Genome, LAION GPT4V |
|
| Data Volume: | |
| Methodology: | | Three-stage training: image and text alignment using ViT and Yi LLM with datasets like LAION-400M and Visual Genome. |
|
| Training Time: | | 10 days for Yi-VL-34B, 3 days for Yi-VL-6B |
|
| Hardware Used: | | 128 NVIDIA A800 (80G) GPUs |
|
| Model Architecture: | | Vision Transformer (ViT) initialized with CLIP ViT-H/14, a projection module, and Yi LLMs. |
|
|
| Input Output |
| Input Format: | |
| Accepted Modalities: | |
| Output Format: | |
|