| Model Type | |
| Use Cases |
| Areas: | | research on large multimodal models and chatbots |
|
| Applications: | | multimodal table understanding |
|
| Primary Use Cases: | | research in computer vision, NLP, ML, AI |
|
| Limitations: | | Limited to one table image as input, low input resolution may limit capacity. |
|
|
| Additional Notes | | Evaluated on 17 held-in and 7 held-out tabular benchmarks, and 2 non-tabular benchmarks: TextVQA and llava-bench-in-the-wild. |
|
| Supported Languages | |
| Training Details |
| Data Sources: | | SpursgoZmy/MMTab, liuhaotian/LLaVA-Instruct-150K, liuhaotian/LLaVA-Pretrain |
|
| Data Volume: | | Approximately 1.5 million instances |
|
| Methodology: | | Two-stage training: pre-training with image-caption and table recognition data, followed by instruction tuning with multimodal data. |
|
| Model Architecture: | | Follows LLaVA-v1.5 with CLIP-ViT-L-336px as the visual encoder, Vicuna-v1.5-13B as base LLM, and a two-layer MLP as the vision-language connector. |
|
|
| Input Output |
| Input Format: | | Single table image of resolution 336*336 |
|
| Accepted Modalities: | |
|