| Model Type | |
| Use Cases |
| Areas: | | Research, Commercial applications |
|
| Applications: | | Natural language understanding and generation, Mechanistic interpretability, Sentiment analysis, Summarization |
|
| Primary Use Cases: | | Research purposes for Arabic NLP, Commercial chat applications, Sentiment analysis, Academic research |
|
| Limitations: | | Prohibited from generating harmful content, Sensitive information handling, Generalization across non-supported languages, High-stakes decision making |
|
| Considerations: | | Efforts to ensure cultural adaptation and diverse topic range in fine-tuning datasets. |
|
|
| Additional Notes | | Techniques used for Arabic model augmentation applicable to other low-resource languages. |
|
| Supported Languages | | Arabic (MSA) (Strong capabilities), English (Strong capabilities) |
|
| Training Details |
| Data Sources: | | Web pages, Wikipedia articles, News articles, Social network content, Code data, Books, Scientific papers, Synthetic data (English to Arabic translations) |
|
| Data Volume: | | Up to 1.6 Trillion tokens |
|
| Methodology: | | Documents packed with EOS tokens for pre-training and frozen backbone during adapted pre-training. Instructional fine-tuning for chat models. |
|
| Context Length: | |
| Hardware Used: | | Condor Galaxy supercomputer, 64 Cerebras CS-2 Wafer-Scale Engines |
|
| Model Architecture: | | Auto-regressive Transformer-based, decoder-only architecture with support for long context lengths. |
|
|
| Responsible Ai Considerations |
| Mitigation Strategies: | | Minimized biases; AI assistant role limited to Arabic and English for fine-tuned models. |
|
|
| Input Output |
| Input Format: | |
| Accepted Modalities: | |
| Output Format: | |
|