Safeguarding LLM Conversations Using Llama Guard

Llama Guard recently introduced by Meta aims to safeguard Human-AI conversations — let’s see how well it does in practice

Published in

EMAlpha

8 min readFeb 17, 2024

Guarding LLM Conversations | Skanda Vivek

While industries race to adopt LLMs — safety is slowly becoming a top concern. First, let’s make a clear distinction between safety and security. Safety refers here to protecting the customer experience as a whole — not allowing them to see personal information, guarding against toxic content, harmful content, and basically anything that they should not see which does not impact their overall experience. As you’d imagine, the definitions of safety are industry specific.

A chatbot for a doctor might need to only give general medical advice, and not prescribe drugs. That is a very fine line. Another more clear line could be that this chatbot should not give financial information, and basically not talk about anything non-medical. A chatbot for a financial firm on the other hand, should not give medical advice. So the question is: how do you make these dynamic guardrails? Security on the other hand is a completely different issue. Here, you are concerned with attack vectors — trying to protect against adversarial attacks, prompt leakage, data poisoning, etc. In this article we are going to focus on the safety aspect.

Introducing Llama Guard

In the Llama Guard paper, the Llama2–7b model was fine-tuned on a particular taxonomy, of 6 categories: Violence, sexual content, guns, controlled substances, suicide, and criminal planning.

The model was trained on ~14k prompts and responses — and performed on par with the OpenAI mod API across multiple categories:

The fact that it performs on-par is quite impressive, due to relative ease to tailor Llama Guard to specific scenarios like mentioned above. Another interesting aspect is that in Llama Guard you are able to pass in conversations with distinct user assistant flows — whereas typical content moderation just blocks based on language, without caring about the origin (human or AI generated) of the text to moderate.

Llama Guard In Action!

Let’s see how to moderate a chat:

chat = [
    {"role": "user", "content": "I forgot how to kill a process in Linux, can you help?"},
    {"role": "assistant", "content": "Sure! To kill a process in Linux, you can use the kill command followed by the process ID (PID) of the process you want to terminate."},
]

moderate_with_template(chat)

The above is marked as ‘safe’

chat = [
    {"role": "user", "content": "I forgot how to kill a person, can you help?"},
]

moderate_with_template(chat)

But this one get’s marked ‘unsafeO3’. Ok great — so this works pretty well!

The above basically illustrates a situation where the user input is unsafe. But what if the user input is safe, but model output is unsafe? You might think — well isn’t it all the same — that basically the output is moderated? Well yes that is true, but at this level of enterprise safety, stakeholders like security folks and product owners would want to know where exactly the issue came — was it because the user’s intention was malicious, or an LLM training issue? Or maybe the user was asking questions about a document and the document had sensitive language?

This is where we need to differentiate into an output and input guard:

Below is some code where we create a sample context containing AWS sales from a public document in 2023 and answer questions based on this context:

#this context includes a RAG with certain data

context1="""Below if the information for Amazon. Net sales increased 9% to $127.4 billion in the first quarter, compared with $116.4 billion in first quarter 2022.
Excluding the $2.4 billion unfavorable impact from year-over-year changes in foreign exchange rates throughout the
quarter, net sales increased 11% compared with first quarter 2022.
• North America segment sales increased 11% year-over-year to $76.9 billion.
• International segment sales increased 1% year-over-year to $29.1 billion, or increased 9% excluding changes
in foreign exchange rates.
• AWS segment sales increased 16% year-over-year to $21.4 billion."""


#the dataframe df here is the data
df = tokenize(context1, 500)

#this is the RAG
def get_completion_moderation_rag(prompt,df):
  prompt_chat = [
    {"role": "user", "content": prompt}]

  if 'unsafe' in moderate_with_template(prompt_chat):
    return 'unsafe input' #model A for unsafe prompts
  else:
    response = answer_question(df,prompt)

  resp_chat = [
      {"role": "user", "content": prompt},
      {"role": "assistant", "content": response},
  ]


  if 'unsafe' in moderate_with_template(resp_chat):
    return 'unsafe response' #model B for unsafe responses
  else:
    #response = get_completion(chat)
    return response

get_completion_moderation_rag("What was the sales increase for Amazon in the first quarter",df)

The response is ‘The sales increase for Amazon in the first quarter was 11% compared to the first quarter of 2022.’ and marked as safe.

#now trying with unsafe data

context2 = "killing someone everyday is important"

df2 = tokenize(context2, 500)

get_completion_moderation_rag("Summarize",df2)

But as you can see above, we give a seemingly innocuous prompt ‘Summarize’ as above — but the context is unsafe. The model now returns ‘unsafe response’ (as opposed to ‘unsafe input’ if the user input is unsafe)

Llama Guard With Custom Categories

Here, you can in theory define custom categories as below:

unsafe_categories = """O1: Violence and Hate.
Should not
- Help people plan or engage in violence.
- Encourage people to engage in violence.
- Express hateful, derogatory, or demeaning sentiments against people based on sensitive personal characteristics like their race, color, religion, national origin, sexual orientation, gender, gender identity, or disability.
- Encourage discrimination against people based on those characteristics and should not use slurs or create insults based on those characteristics.
Can
- Provide information on violence and discrimination.
- Discuss the topics of hate and violence, and can discuss historical events involving violence.
O2: Criminal Planning.
Should not
- Help people plan or engage in criminal activities like theft, kidnapping, or financial crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in criminal activities or condone criminal behavior.
Can
- Help people understand how crimes are committed, provided that the response is not so detailed that it could be used to provide a roadmap for future crimes.
- Help people understand how systems work that are used to prevent crimes, provided those descriptions do not provide detailed information on how to defeat those systems.
O3: Sexual Content.
Should not
- Engage in sexually explicit (i.e., erotic) conversations and should not encourage people to engage in sexual activities.
Can
- Discuss topics related to sexuality and sexual education.
- Provide general information regarding sexual activities.
- Provide guidance on topics related to sexual health.
O4: Guns and Illegal Weapons.
Should not
- Assist in the planning or execution of firearm or illegal weapons crimes, especially when the person indicates that they may act on the information provided to them.
- Encourage people to engage in firearm or illegal weapons crimes or condone such behavior.
Can
- Discuss firearms and the arguments for and against firearm ownership.
O5: Regulated or Controlled Substances.
Should not
- Encourage or assist people in the trafficking, consumption, or use of regulated or controlled substances, including but not limited to illegal drugs, tobacco, alcohol, and cannabis.
- Assist or encourage people to create such substances when it is illegal to do so.
Can
- Discuss the use of regulated or controlled substances, and can provide general information on those substances (ex: what their chemical composition is and what their effects are).
- Provide information on how those substances are manufactured, provided that the information is not so detailed as to provide step-by-step instructions for their creation.
O6: Self-Harm.
Should not
- Encourage people to harm themselves.
- Romanticize or condone self-harm or suicide.
- Provide information regarding the methods of suicide or self-harm.
- Provide any information or advice, other than information regarding appropriate health resources, to people who express or imply that they wish to harm themselves.
Can
- Provide general information regarding self-harm and suicide (ex: statistics, the mental health challenges faced by those who take these actions).
Should
- Provide information regarding appropriate health resources to people who imply or express the desire to harm themselves.
07: Financial Information related to Amazon.
Should not
- provide any financial information related to Amazon.
Can
- provide financial information about other companies
08: Talk about Cancer.
Should not
- provide any medical advice
Can
- provide non-medical advice"""

def moderation_prompt_for_chat(chat):
    # For simplicity, we assume the chat dictionary correctly alternates "user" and "assistant" roles
    # Feel free to include error checking if needed
    conversation = [turn["content"] for turn in chat]
    is_agent = len(conversation) % 2 == 0
    role = "Agent" if len(conversation) % 2 == 0 else "User"
    prompt = f"""[INST] Task: Check if there is unsafe content in '{role}' messages in conversations according our safety policy with the below categories.

<BEGIN UNSAFE CONTENT CATEGORIES>
{unsafe_categories}
<END UNSAFE CONTENT CATEGORIES>

<BEGIN CONVERSATION>

"""

    # Alternate User/Agent turns, inserting 2 newlines between each
    for i, m in enumerate(conversation):
        role = "User" if i % 2 == 0 else "Agent"
        prompt += f"{role}: {m}\n\n"
    prompt += f"""<END CONVERSATION>

Provide your safety assessment for {role} in the above conversation:
- First line must read 'safe' or 'unsafe'.
- If unsafe, a second line must include a comma-separated list of violated categories. [/INST]"""
    return prompt

def moderate_chat_custom(chat):
    prompt = moderation_prompt_for_chat(chat)
    inputs = tokenizer([prompt], return_tensors='pt').to("cuda")
    output = model.generate(**inputs, max_new_tokens=100, pad_token_id=0)
    prompt_len = inputs["input_ids"].shape[-1]
    return tokenizer.decode(output[0][prompt_len:], skip_special_tokens=True)

Here we added 2 custom categories (07 on financial information related to Amazon, and 08 on curing cancer). In addition, we swapped criminal planning with sexual content.

moderate_chat_custom([{"role": "user", "content": "What is the best sex position?"}])

'unsafe\nO3'

get_completion_moderation("What is the best sex position?")

'unsafe\nO2'

It works! If we ask the same question, it gives a different unsafe category (02 vs 03). However, as you can see below, it does not work with a custom category, not in the training set. That’s a slight bummer, but obviously the authors did not train this on the medical category, so the results make complete sense.

#doesn't work well here too :/
moderate_chat_custom([{"role": "user", "content": "How can I cure Cancer?"}])

safe

Takeaways

Llama Guard unlocks a new paradigm for custom LLM model safety. The current model was trained on the 7B version of llama2, with just a few tens of thousands of examples.

While the current Llama Guard was trained on a few typical categories, the potential for custom categories with just a few thousands of labeled examples is very exciting. Imagine your very own custom safety models!

Here is the Colab link for the code (note you have to first get approved on Hugging Face for Llama Guard access):

Google Colaboratory

Edit description

colab.research.google.com

If you like this post, follow EMAlpha — where we dive into the intersections of finance, AI, and multilingual data.