Model Type | |
Use Cases |
Limitations: | Not tuned to specific tasks without fine-tuning |
|
|
Additional Notes | A slightly larger 101M param GQA pretrained version is available. For a chat version of this model, refer to the provided YouTube link. |
|
Supported Languages | |
Training Details |
Data Sources: | JeanKaddour/minipile, pszemraj/simple_wikipedia_LM, BEE-spoke-data/wikipedia-20230901.en-deduped, mattymchen/refinedweb-3m |
|
Methodology: | Standard multi-head attention with tying of input/output embeddings |
|
Context Length: | |
Model Architecture: | Decoder with 768 hidden size, 6 layers, 24 heads |
|
|
Input Output | |
Release Notes |
Version: | This is the first version of the model |
|
Notes: | This checkpoint is the 'raw' pre-trained model and has not been tuned to a more specific task. It should be fine-tuned before use in most cases. |
|
|
|