| Training Details |
| Data Sources: | | Wikipedia (English, German, Spanish, French), Project Gutenberg, 45 Subreddits, OpenWebText, news data, Amazon Reviews, Europarl and UN data from WMT, ELI5, MRQA shared tasks |
|
| Data Volume: | |
| Methodology: | | Pre-trained using language modeling with control codes as first token |
|
| Training Time: | |
| Hardware Used: | |
| Model Architecture: | | CTRL has model dimension d = 1280, inner dimension f = 8192, 48 layers, and 16 heads per layer. Dropout with probability 0.1 follows the residual connections in each layer. |
|
|