| Model Type | |
| Use Cases |
| Areas: | | Basque language technology and research |
|
| Primary Use Cases: | | Pre-trained LLMs for specific tasks or further fine-tuning for specific use cases. |
|
| Limitations: | | Not fine-tuned to follow instructions or work as a chat assistant. |
|
| Considerations: | | Use with Basque data; performance in other languages is not guaranteed. |
|
|
| Additional Notes | | Models range from 7 to 70 billion parameters, evaluation and datasets are publicly available under open licenses. |
|
| Supported Languages | |
| Training Details |
| Data Sources: | | HiTZ/latxa-corpus-v1.1, EleutherAI/pile |
|
| Data Volume: | |
| Methodology: | | Prioritizing high-quality data sources with deduplication and filtering. Trained using GPT-Neox on HPC infrastructure. |
|
| Context Length: | |
| Training Time: | | 10k steps with 20B total tokens, around 4 epochs |
|
| Hardware Used: | | CINECA HPC Leonardo computing cluster, 3456 nodes each containing 4x custom A100 64Gb GPUs |
|
| Model Architecture: | | Follows Metaβs LLaMA architecture, further trained on Basque corpus |
|
|
| Responsible Ai Considerations |
| Fairness: | | Trained on carefully selected and processed data to minimize disturbing or harmful content. |
|
| Mitigation Strategies: | | Thorough deduplication and filtering process applied on training data. |
|
|
| Input Output |
| Accepted Modalities: | |
| Output Format: | |
|
| Release Notes | |