| Model Type | | text generation, multimodal |
|
| Use Cases |
| Areas: | | research, commercial applications |
|
| Applications: | | general AI systems, visual and text input capabilities, memory/compute constrained environments, OCR, image understanding |
|
| Primary Use Cases: | | general image understanding, text generation, language understanding |
|
| Limitations: | | not evaluated for all downstream purposes, inappropriate for high-risk scenarios without additional safeguards |
|
| Considerations: | | Developers should ensure accuracy, safety, and fairness for their use cases. |
|
|
| Additional Notes | | Phi-3-Vision-128K-Instruct is designed for use in latency-constrained scenarios and comes with rights for commercial use. Developers should apply responsible AI best practices. |
|
| Supported Languages | | multilingual (high quality, reasoning dense) |
|
| Training Details |
| Data Sources: | | publicly available documents, high-quality educational data and code, selected high-quality image-text interleave, synthetic data for teaching math, coding, and reasoning, newly created image data (charts, tables, diagrams), high-quality chat format supervised data |
|
| Data Volume: | | 500B vision and text tokens |
|
| Methodology: | | supervised fine-tuning and direct preference optimization for instruction adherence |
|
| Context Length: | |
| Training Time: | |
| Hardware Used: | |
| Model Architecture: | | Includes image encoder, connector, projector, and Phi-3 Mini language model |
|
|
| Safety Evaluation |
| Risk Categories: | | misinformation, offensive content, bias |
|
| Ethical Considerations: | | Models may produce inappropriate or offensive content. Developers should implement necessary safeguards. |
|
|
| Responsible Ai Considerations |
| Fairness: | | Models can over- or under-represent groups of people and reinforce stereotypes. |
|
| Transparency: | | Developers should inform users they are interacting with AI. |
|
| Accountability: | | Developers are responsible for ensuring use case compliance with laws. |
|
| Mitigation Strategies: | | Additional debiasing techniques and RAG for misinformation. |
|
|
| Input Output |
| Input Format: | | Text and image as inputs using chat template format |
|
| Accepted Modalities: | |
| Output Format: | | Generated text in response |
|
|