| Model Type | | text generation, code generative |
|
| Use Cases |
| Primary Use Cases: | | Code generation, Code explanation, Code fixing, Generating unit tests, Generating documentation, Addressing technical debt issues, Vulnerability detection, Code translation |
|
| Limitations: | | Generated code is not guaranteed to work as intended, Model generates problematic outputs, Risk of verbatim copying due to memorization in smaller models |
|
| Considerations: | | Caution urged against complete reliance, potential for problematic outputs and hallucination |
|
|
| Supported Languages | | languages_supported (116 programming languages), proficiency (comprehensive understanding) |
|
| Training Details |
| Data Sources: | | Publicly available datasets (e.g., GitHub Code Clean, Starcoder data), Additional public code repositories and issues from GitHub |
|
| Data Volume: | | Phase 1: 3 trillion tokens, Phase 2: 1 trillion tokens |
|
| Methodology: | | Two-phase training strategy |
|
| Hardware Used: | | IBM's Vela and Blue Vela super computing clusters, NVIDIA A100 and H100 GPUs |
|
| Model Architecture: | | Decoder-only architecture designed for code-generative tasks |
|
|
| Responsible Ai Considerations |
| Mitigation Strategies: | | HAP content filter, PII redaction, malware scanning |
|
|