| Model Type | | generative code model, text generation, code synthesis |
|
| Use Cases |
| Areas: | |
| Primary Use Cases: | |
| Limitations: | | Potential for misuse in generating vulnerable/malicious code |
|
| Considerations: | | Model-generated code must not be executed without precautions. |
|
|
| Additional Notes | | Pretrained on a mixture of open-source web text and Python code. |
|
| Training Details |
| Data Sources: | | bigcode/the-stack, HuggingFaceFW/fineweb, Magicoder, StarCoder2 OSS-Instruct |
|
| Data Volume: | |
| Methodology: | | pretrained on open-source web text and Python code. Instruction tuned on synthetic edit sequence data using the LintSeq algorithm. |
|
| Training Time: | | Pretraining took about two days (150M) and six days (400M). Instruction tuning took several hours. |
|
| Hardware Used: | | single H100 node (four GPUs) for pretraining, single H100 GPU for instruction tuning |
|
| Model Architecture: | | Autoregressive language models mimicking GPT-2 architectures with OLMo model transformer architecture changes. |
|
|
| Safety Evaluation |
| Risk Categories: | | potential misuse for vulnerabilities/malicious code generation |
|
| Ethical Considerations: | | The importance of handling model-generated code with precautions. |
|
|
| Input Output |
| Input Format: | |
| Output Format: | | Text and code outputs. Instruction tuned models generate code via 'diffs'. |
|
|