| Model Type | | text generation, code generation |
|
| Use Cases |
| Areas: | | enterprise, software engineering productivity |
|
| Applications: | | code generation, code explanation, code fixing, generating unit tests, generating documentation, addressing technical debt issues, vulnerability detection, code translation |
|
| Limitations: | | Not undergone any safety alignment, Potential susceptibility to hallucination in generation scenarios |
|
| Considerations: | | Not fully reliable for crucial decisions; ethical and responsible usage recommended. |
|
|
| Additional Notes | | Trained on extensive code datasets across 116 languages. |
|
| Supported Languages | | Python (high), C (high), C++ (high), Go (high), Java (high), JavaScript (high), TypeScript (high) |
|
| Training Details |
| Data Sources: | | codeparrot/github-code-clean, bigcode/starcoderdata, open-web-math/open-web-math, math-ai/StackMathQA, bigcode/humanevalpack, repoqa, lcc, repobench |
|
| Data Volume: | |
| Methodology: | | Continual pretraining with repository-level file packing and per-language length upsampling |
|
| Context Length: | |
| Hardware Used: | | NVIDIA A100 GPUs, NVIDIA H100 GPUs |
|
|
| Responsible Ai Considerations |
| Mitigation Strategies: | | Not undergone any safety alignment; active area of research to handle risks of problematic outputs and malicious utilization. |
|
|
| Input Output |
| Input Format: | | text input (e.g., code snippets) |
|
| Accepted Modalities: | |
| Output Format: | |
|
| Release Notes |
| Date: | |
| Notes: | | Extended context length from 2K to 128K with continual pretraining. |
|
|
|