| Model Type | | causal language model, text generation |
|
| Use Cases |
| Areas: | | research, commercial applications |
|
| Primary Use Cases: | | causal language modeling, text-generation tasks |
|
| Limitations: | | Limited to training data biases; No bias and toxicity estimates currently available. |
|
| Considerations: | | Model is trained on data that may contain biases and should be used with caution. |
|
|
| Additional Notes | | Small amount of English data was retained to prevent catastrophic forgetting. |
|
| Supported Languages | | en (High), es (High), ca (High) |
|
| Training Details |
| Data Sources: | | Wikipedia, C4_es, Biomedical, Legal, Gutenberg, C4_ca, RacoCatalà Noticias, RacoCatalà Forums, CaWaC, Vilaweb |
|
| Data Volume: | |
| Methodology: | | Adapted by swapping the tokenizer and adjusting the embedding layer. |
|
| Training Time: | |
| Hardware Used: | | 8 NVIDIA H100 GPUs with 80GB RAM |
|
| Model Architecture: | | Byte-Pair Encoding (BPE) tokenizer with 50,257 tokens. |
|
|
| Responsible Ai Considerations |
| Fairness: | | No measures have been taken to estimate bias and toxicity at the time of submission. |
|
| Transparency: | | Model is provided with documentation of its creation and intended use. |
|
| Accountability: | | Accountability lies with the users deploying the model. |
|
| Mitigation Strategies: | | The model users should aim to mitigate risks associated with bias and toxicity. |
|
|
| Input Output |
| Input Format: | |
| Accepted Modalities: | |
| Output Format: | |
|
| Release Notes | |