| Model Type | | content safety classification, LLM |
|
| Use Cases |
| Areas: | | Content moderation, Security systems, Search and code interpretation tools |
|
| Applications: | | Enterprise content management systems, Search optimizations and secure AI deployments |
|
| Primary Use Cases: | | Moderating harmful content in AI-generated inputs and outputs, Ensuring safety in search tool interactions and code interpretation applications |
|
| Limitations: | | Performance limited by training data context and scope, Potential susceptibility to prompt injection attacks |
|
| Considerations: | | Recommended for use alongside broader moderation systems for cases requiring up-to-date factual evaluations. |
|
|
| Additional Notes | | Dataset expansions include multilingual conversation data and challenges to border cases to reduce false positives further. |
|
| Supported Languages | | English (full), French (full), German (full), Hindi (full), Italian (full), Portuguese (full), Spanish (full), Thai (full) |
|
| Training Details |
| Data Sources: | | Llama Guard 1 and 2 generations, multilingual conversation data, Brave Search API query results |
|
| Methodology: | | Fine-tuning on pre-trained Llama 3.1 model |
|
|
| Safety Evaluation |
| Methodologies: | | Aligned with MLCommons standardized hazards taxonomy, Internal tests against multilingual dataset, Comparison with prior versions and competitor models, Synthetic generation of safety data |
|
| Findings: | | Improved safety classification while lowering false positive rates |
|
| Risk Categories: | | Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Defamation, Specialized Advice, Privacy, Intellectual Property, Indiscriminate Weapons, Hate, Suicide & Self-Harm, Sexual Content, Elections, Code Interpreter Abuse |
|
| Ethical Considerations: | | Focus on reducing false positives while ensuring comprehensive content moderation across categories |
|
|
| Responsible Ai Considerations |
| Fairness: | | Supports multilingual content moderation intended for a wide range of languages with consistent policy alignment. |
|
| Transparency: | | Commercial conditions and usage thresholds are clearly outlined in the licensing terms. |
|
| Accountability: | | Users are accountable for adhering to the Acceptable Use Policy and Meta's broader policy guidelines. |
|
| Mitigation Strategies: | | Policies and thresholds are provided to manage usage levels, and community reporting mechanisms are in place for policy violations. |
|
|
| Input Output |
| Input Format: | | Prompt and response classification |
|
| Accepted Modalities: | |
| Output Format: | | Safety classification and content category indication |
|
| Performance Tips: | | Use the model with specified transformer versions and follow community guidelines for optimal safety checks. |
|
|
| Release Notes |
| Version: | |
| Date: | |
| Notes: | | Improved safety evaluations, expanded multilingual capabilities, and refined moderation taxonomies. |
|
|
|