Two-phases training method from MiniCPM. Phase 1 uses a constant learning rate with linear warmup and is trained on 1 trillion tokens from large-scale open-source pretraining datasets, including RefinedWeb, Pile, Github data, etc. Phase 2 uses exponential learning rate decay and is trained on 250 billion tokens from phase 1 datasets and extra high-quality open-source datasets.
Hardware Used:
96×H100 GPU cluster
Model Architecture:
24 blocks with two MoE layers per block: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE). Each MoA and MoE layer has 8 experts with 2 activated per input token. 8 billion total parameters and 2.2B active during inference.
Note: green Score (e.g. "73.2") means that the model is better than jetmoe/jetmoe-8b-sft.
Rank the Jetmoe 8B Sft Capabilities
🆘 Have you tried this model? Rate its performance. This feedback would greatly assist ML community in identifying the most suitable model for their needs. Your contribution really does make a difference! 🌟
Instruction Following and Task Automation
Factuality and Completeness of Knowledge
Censorship and Alignment
Data Analysis and Insight Generation
Text Generation
Text Summarization and Feature Extraction
Code Generation
Multi-Language Support and Translation
What open-source LLMs or SLMs are you in search of? 53254 in total.