Meta’s Purple Llama

4 min readJan 18, 2024

Introduction-

Safety and encryption of the data have always been a primary duty of any developing machine. Due to the immense use of AI generative models/tools, there has been a lot of production of data which thus makes us think of security. Therefore, an approach to open trust and safety in the era of generative AI was an origination of Purple Llama. It is a broad initiative with open trust and safety tools and assessments designed to provide developers with equal opportunities to appropriately implement generative AI models and experiences in line with best practices described in our Responsible Use Guide.

Announcing Purple Llama: an approach to safety in generative AI

A completely new wave of invention has emerged thanks to generative AI. With it, we can effectively summarize vast document corpora, make realistic graphics, and conduct easy conversations with conversational AIs — all from simple instructions. Since Llama models have been downloaded over 100 million times, open models are largely responsible for this invention.

Taking the inspiration from “Purple teaming” it embraces both offensive (red team) and defensive (blue team) strategies. The goal is to empower developers to deploy generative AI models responsibly, aligning with best practices outlined in our Responsible Use Guide. Our investment in Purple Llama reflects a comprehensive approach, seamlessly integrated to guide developers through the AI innovation landscape, from ideation to deployment. It brings together tactics for testing, improving, and securing generative AI, to support your mitigation strategies.

The first step there is a release of CyberSec Eval, a set of cybersecurity safety evaluation benchmarks for LLMs; and Llama Guard, a safety classifier for input/output filtering that is optimized for ease of deployment.

First Step: CyberSec Eval(overview)

Large Language Models (LLMs) have shown tremendous progress on a variety of tasks related to human cognition. Still, they have stood out in their ability to write functional code in response to natural language requests. At the same time, much attention has been paid to safety issues that these models present but only a limited number of works exist on measuring and mitigating risk in the domain of cybersecurity. In this work, development and measurement approaches for issues of cyber security, open-source CYBERSECEVAL.

To ultimately mitigate this kind of risk, CYBERSECEVAL is designed to integrate into the development and testing processes of those designing code-producing large language models. By identifying insecure coding practices in LLM output across many languages, CYBERSECEVAL can help identify risks and offer clear directions for improvement. By iteratively refining models based on these evaluations, model developers can enhance the security of the code generated by their AI systems.

Llama Guard

Introduction to a safety risk taxonomy that is associated with interacting with AI agents. The taxonomy covers a set of potential legal and policy risks that can apply to several developer use cases. Llama Guard, an LLM-based input-output safeguard model, fine-tuned on data labeled according to our taxonomy. It includes the applicable taxonomy as the input and uses instruction tasks for classification. This allows users to customize the model input to adapt to other taxonomies appropriate for their use case with zero-shot or few-shot prompting. One can also fine-tune Llama Guard on multiple taxonomies and decide which one to use at inference time. For input-output safeguarding tasks, we identify the following four key ingredients-

A set of guidelines. Each task takes a set of guidelines as input, which consists of numbered categories of violation, as well as plain text descriptions as to what is safe and unsafe within that category. The model should only consider the given categories and their descriptions for making a safety assessment.

The type of classification. Each task indicates whether the model needs to classify the user messages (dubbed “prompts”) or the agent messages (dubbed “responses”). The distinction of prompt vs. response classification is an important one.

The conversation. Each task contains a conversation where users and agents take turns. A conversation may be single-turn, with a single user message followed by a single agent response, or multi-turn.

The output format. Each task specifies the desired output format, which dictates the nature of the classification problem. In Llama Guard, the output contains two elements. First, the model should output “safe” or “unsafe”.

Conclusion-In this we introduced CYBERSECEVAL, a comprehensive benchmark for evaluating the cybersecurity risks of large language models (LLMs). Our evaluation suite assesses the propensity of LLMs to generate insecure code and their compliance with requests to assist in cyberattacks also with Llama Guard, an LLM-based input-output safeguard model applicable for human-AI conversations. We also introduced a safety risk taxonomy and the application.

References

1) https://scontent.fhyd11-1.fna.fbcdn.net/v/t39.2365-6/408725049_3688557441468029_8103913771964668529_n.pdf?_nc_cat=100&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=8CMAJaa-jCMAX-pnEFB&_nc_ht=scontent.fhyd11-1.fna&oh=00_AfB5u591tIsglOfOqkqErfNgtF8vAj4NLXEXiCL8dgnt5A&oe=65870719

2) https://scontent.fhyd11-1.fna.fbcdn.net/v/t39.2365-6/10000000_1118625475774351_8659185461585946233_n.pdf?_nc_cat=102&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=CFBP36gjTJQAX-O5BdI&_nc_ht=scontent.fhyd11-1.fna&oh=00_AfDiipkNH4lMX_GoZLKWPWrsin4ronkDwCTi3UbuvjPsDg&oe=65870DC7

3) https://ai.meta.com/blog/purple-llama-open-trust-safety-generative-ai/

Written by Swati Kapoor

Meta’s Purple Llama

Written by Startup Grind Chapter Bhubaneswar