| Model Type | | content safety classification, LLM | 
 | 
| Use Cases | 
| Areas: | | Content moderation, Security systems, Search and code interpretation tools | 
 |  | Applications: | | Enterprise content management systems, Search optimizations and secure AI deployments | 
 |  | Primary Use Cases: | | Moderating harmful content in AI-generated inputs and outputs, Ensuring safety in search tool interactions and code interpretation applications | 
 |  | Limitations: | | Performance limited by training data context and scope, Potential susceptibility to prompt injection attacks | 
 |  | Considerations: | | Recommended for use alongside broader moderation systems for cases requiring up-to-date factual evaluations. | 
 |  | 
| Additional Notes | | Dataset expansions include multilingual conversation data and challenges to border cases to reduce false positives further. | 
 | 
| Supported Languages | | English (full), French (full), German (full), Hindi (full), Italian (full), Portuguese (full), Spanish (full), Thai (full) | 
 | 
| Training Details | 
| Data Sources: | | Llama Guard 1 and 2 generations, multilingual conversation data, Brave Search API query results | 
 |  | Methodology: | | Fine-tuning on pre-trained Llama 3.1 model | 
 |  | 
| Safety Evaluation | 
| Methodologies: | | Aligned with MLCommons standardized hazards taxonomy, Internal tests against multilingual dataset, Comparison with prior versions and competitor models, Synthetic generation of safety data | 
 |  | Findings: | | Improved safety classification while lowering false positive rates | 
 |  | Risk Categories: | | Violent Crimes, Non-Violent Crimes, Sex-Related Crimes, Child Sexual Exploitation, Defamation, Specialized Advice, Privacy, Intellectual Property, Indiscriminate Weapons, Hate, Suicide & Self-Harm, Sexual Content, Elections, Code Interpreter Abuse | 
 |  | Ethical Considerations: | | Focus on reducing false positives while ensuring comprehensive content moderation across categories | 
 |  | 
| Responsible Ai Considerations | 
| Fairness: | | Supports multilingual content moderation intended for a wide range of languages with consistent policy alignment. | 
 |  | Transparency: | | Commercial conditions and usage thresholds are clearly outlined in the licensing terms. | 
 |  | Accountability: | | Users are accountable for adhering to the Acceptable Use Policy and Meta's broader policy guidelines. | 
 |  | Mitigation Strategies: | | Policies and thresholds are provided to manage usage levels, and community reporting mechanisms are in place for policy violations. | 
 |  | 
| Input Output | 
| Input Format: | | Prompt and response classification | 
 |  | Accepted Modalities: |  |  | Output Format: | | Safety classification and content category indication | 
 |  | Performance Tips: | | Use the model with specified transformer versions and follow community guidelines for optimal safety checks. | 
 |  | 
| Release Notes | | 
| Version: |  |  | Date: |  |  | Notes: | | Improved safety evaluations, expanded multilingual capabilities, and refined moderation taxonomies. | 
 |  | 
 |