| Model Type | | Transformer, Visual Question Answering |
|
| Use Cases |
| Areas: | | Visual Question Answering, Image and Video Captioning, Image Classification |
|
| Applications: | | Research, Commercial applications |
|
| Primary Use Cases: | | Visual question answering on TextVQA dataset |
|
|
| Additional Notes | | The checkpoint described here is 'GIT-base', a smaller variant of the GIT model fine-tuned specifically for TextVQA. |
|
| Supported Languages | |
| Training Details |
| Data Sources: | | COCO, Conceptual Captions (CC3M), SBU, Visual Genome (VG), Conceptual Captions (CC12M), ALT200M, Additional 0.6B image-text pairs |
|
| Data Volume: | | 10 million image-text pairs for GIT-base variant |
|
| Methodology: | |
| Model Architecture: | | Transformer decoder conditioned on CLIP image tokens and text tokens. |
|
|