| Model Type | | multimodal, text generation, vision | 
 | 
| Use Cases | 
| Areas: |  |  | Applications: | | general image understanding, OCR, chart and table understanding | 
 |  | Primary Use Cases: | | memory/compute constrained environments, latency bound scenarios | 
 |  | Limitations: | | Not evaluated for all downstream purposes, Potential quality degradation in non-English language use cases | 
 |  | Considerations: | | Developers should mitigate biases and misuses. | 
 |  | 
| Supported Languages | | languages_supported (multilingual), proficiency_level (English focused, other languages might perform worse.) | 
 | 
| Training Details | 
| Data Sources: | | publicly available documents, high-quality educational data, selected high-quality image-text interleave, synthetic data | 
 |  | Data Volume: | | 500B vision and text tokens | 
 |  | Methodology: | | Supervised fine-tuning, direct preference optimization | 
 |  | Context Length: |  |  | Training Time: |  |  | Hardware Used: |  |  | Model Architecture: | | Image encoder, connector, projector, and Phi-3 Mini language model | 
 |  | 
| Safety Evaluation | 
| Ethical Considerations: | | Potential for producing inappropriate, unreliable, or biased content due to training data limitations. | 
 |  | 
| Responsible Ai Considerations | 
| Fairness: | | Potential biases due to varying representation of groups in data. | 
 |  | Transparency: | | Developers responsible for evaluating model output for fairness and accuracy. | 
 |  | Accountability: | | Developers should ensure model applications adhere to laws and regulations. | 
 |  | Mitigation Strategies: | | Safety post-training, adherence to use-case laws. | 
 |  | 
| Input Output | 
| Input Format: | | Single image with chat format prompts | 
 |  | Accepted Modalities: |  |  | Output Format: | | Text generated in response to input | 
 |  |