Training Details |
Data Sources: | Korean blog posts, Korean news dataset, Modu corpus, Korean patent dataset, Korean Q & A dataset, KcBert dataset, Korean fiction dataset, Korean online comments, Korean wikipedia, Clova call, Naver sentiment movie corpus, Korean hate speech dataset, Open subtitles, AIHub various tasks datasets, Standard Korean language dictionary |
|
Data Volume: | 863 GB (1.2TB before processing) |
|
Methodology: | Trained with cross-entropy loss to maximize the likelihood of predicting the next token. Used EleutherAI GPT-NeoX framework. |
|
Context Length: | |
Training Time: | |
Hardware Used: | |
Model Architecture: | 40 transformer layers, model dimension 5120, feedforward dimension 20480, 40 heads, head dimension 128, RoPE applied to 64 dimensions of each head. Tokenization vocabulary of 30003. |
|
|