| Model Type | | text-generation, content moderation |
|
| Use Cases |
| Areas: | | Safety and content moderation |
|
| Applications: | | Online platforms requiring content moderation |
|
| Primary Use Cases: | | Classifying content for safety in both inputs and responses |
|
| Limitations: | | Performance limited by training data, Not designed for chat use cases, Susceptible to adversarial attacks |
|
| Considerations: | | Recommended to be used with additional solutions for unsupported categories |
|
|
| Additional Notes | | Supports 11 out of the 13 categories included in the MLCommons AI Safety taxonomy. Election and Defamation categories are not addressed. |
|
| Training Details |
| Data Sources: | | Llama Guard training set, MLCommons taxonomy, hard samples from Llama 2 70B |
|
| Methodology: | | Fine-tuned for safety classification |
|
|
| Safety Evaluation |
| Methodologies: | | Harm Taxonomy, MLCommons taxonomy alignment |
|
| Findings: | | Strong adaptability to other policies, Superior tradeoff between F1 score and False Positive Rate |
|
| Risk Categories: | | Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Specialized Advice, Privacy, Intellectual Property, Indiscriminate Weapons, Hate, Suicide & Self-Harm, Sexual Content |
|
| Ethical Considerations: | |
|
| Responsible Ai Considerations |
| Mitigation Strategies: | | Using external components like KNN |
|
|
| Input Output |
| Input Format: | |
| Accepted Modalities: | |
| Output Format: | | Binary classification (safe/unsafe) |
|
| Performance Tips: | | Align model with specific safety considerations for better moderation |
|
|