image and video captioning, visual question answering, image classification
Primary Use Cases:
image captioning
Additional Notes
The model is trained using "teacher forcing" method and uses a bidirectional attention mask for image tokens and causal attention mask for text tokens.
Training Details
Data Sources:
COCO, Conceptual Captions (CC3M), SBU, Visual Genome (VG), Conceptual Captions (CC12M), ALT200M, extra data following Hu et al. (2021a)
Data Volume:
10 million image-text pairs for GIT-base
Methodology:
Teacher forcing on a lot of (image, text) pairs
Model Architecture:
Transformer decoder conditioned on CLIP image tokens and text tokens
๐ Have you tried this model? Rate its performance. This feedback would greatly assist ML community in identifying the most suitable model for their needs. Your contribution really does make a difference! ๐
Instruction Following and Task Automation
Factuality and Completeness of Knowledge
Censorship and Alignment
Data Analysis and Insight Generation
Text Generation
Text Summarization and Feature Extraction
Code Generation
Multi-Language Support and Translation
What open-source LLMs or SLMs are you in search of? 52721 in total.