| Model Type | | multimodal, text generation, vision |
|
| Use Cases |
| Areas: | |
| Applications: | | general image understanding, OCR, chart and table understanding |
|
| Primary Use Cases: | | memory/compute constrained environments, latency bound scenarios |
|
| Limitations: | | Not evaluated for all downstream purposes, Potential quality degradation in non-English language use cases |
|
| Considerations: | | Developers should mitigate biases and misuses. |
|
|
| Supported Languages | | languages_supported (multilingual), proficiency_level (English focused, other languages might perform worse.) |
|
| Training Details |
| Data Sources: | | publicly available documents, high-quality educational data, selected high-quality image-text interleave, synthetic data |
|
| Data Volume: | | 500B vision and text tokens |
|
| Methodology: | | Supervised fine-tuning, direct preference optimization |
|
| Context Length: | |
| Training Time: | |
| Hardware Used: | |
| Model Architecture: | | Image encoder, connector, projector, and Phi-3 Mini language model |
|
|
| Safety Evaluation |
| Ethical Considerations: | | Potential for producing inappropriate, unreliable, or biased content due to training data limitations. |
|
|
| Responsible Ai Considerations |
| Fairness: | | Potential biases due to varying representation of groups in data. |
|
| Transparency: | | Developers responsible for evaluating model output for fairness and accuracy. |
|
| Accountability: | | Developers should ensure model applications adhere to laws and regulations. |
|
| Mitigation Strategies: | | Safety post-training, adherence to use-case laws. |
|
|
| Input Output |
| Input Format: | | Single image with chat format prompts |
|
| Accepted Modalities: | |
| Output Format: | | Text generated in response to input |
|
|