| Model Type | | multimodal, chatbot, table understanding |
|
| Use Cases |
| Areas: | | Research, Computer Vision, NLP, AI |
|
| Primary Use Cases: | | Multimodal table understanding, Question answering related to tables, Table cell description |
|
| Limitations: | | Single table image input, Low input image resolution (336x336) |
|
|
| Supported Languages | |
| Training Details |
| Data Sources: | |
| Data Volume: | | 708K pre-training, 898K fine-tuning |
|
| Methodology: | | Two-stage pipeline: pre-training with image-caption and table recognition data, instruction tuning with tabular and non-tabular tasks. |
|
| Model Architecture: | | CLIP-ViT-L-336px as visual encoder, Vicuna-v1.5-7B as base LLM, two-layer MLP for vision-language connection. |
|
|
| Input Output |
| Input Format: | |
| Accepted Modalities: | |
| Output Format: | |
|
| Release Notes |
| Version: | |
| Date: | |
| Notes: | | First release for multimodal table understanding, incorporating LLaVA-v1.5 architecture. |
|
|
|