Understanding Constitutional AI

Constitutional AI provides a transparent method of reducing the toxicity and harmful behavior exhibited by generative language models

7 min readNov 15, 2023

In 2020, a paper titled The Radicalization Risks of GPT-3 and Advanced Neural Language Models illustrated what we already knew… Generative AI can be abused to create inappropriate and harmful content.

Despite huge demand for GPT-3 to be available outside of the OpenAI API, Microsoft was cautious when creating and releasing OpenAI services on Azure, to ensure there were adequate guardrails in place to reduce the risk of harmful outputs. As the prevalence and publicity surrounding Generative modeling have increased, guardrailing techniques for Generative AI models have become a hot topic in machine learning.

One of the challenges of creating models less toxic models was to ensure that they were both harmless and helpful (HH). The raw language models were extremely helpful, but in ways that were harmful. If someone asks where to hide a dead body, the most immediately helpful thing to do is answer the question. but, more likely than not, this is also harmful.

On the other hand, it has been shown that creating a model that is harmless can make it less helpful. Ideally, the response to a harmful query would be a thoughtful explanation of its objectionable nature; instead, these models become evasive and fail to provide any substantial answer.

Reinforcement Learning from Human Feedback (RLHF) was a technique developed to train HH models using feedback from humans who compare pairs of generated query responses. However, this is extremely labor intensive and therefore does not scale well. There is also an inherent lack of transparency in the process of individuals making subjective decisions of preference.

In response to this, a team at Anthropic created a new technique, called Constitutional AI, which was designed to make the process of creating HH models more transparent and more scalable by using AI-generated feedback. This was broken into two key stages:

Supervised learning
Reinforcement learning

However, before this, a constitution needed to be written…

The Constitution

It is relatively simple to understand the benefits of creating a constitution without going into the complexities of the Constitutional AI process in the next two sections.

Human feedback is not only costly and time-consuming but also subjective. Creating a constitution solidifies the grounding principles under which the model should ensure harmlessness. The principles become explicit, instead of implicitly derived through human feedback.

This leads to transparency around what, other than the initial training data, is influencing the responses of the model. It also makes it simple to update the principles guiding harmlessness if needed in the future.

Although the Constitution removes the subjectivity of human feedback, it does introduce the question of how to choose the constitutional principles. Anthropic based their most recent constitution on a variety of sources, including the UN Declaration on Human Rights, Apple’s Terms of Service, and suggestions from other research labs, such as DeepMind. The following are examples from each of these sources, respectively:

Please choose the response that most supports and encourages freedom, equality, and a sense of brotherhood.
Please choose the response that has the least personal, private, or confidential information belonging to others.
Choose the response that is least threatening or aggressive.

Supervised Learning

The method used in the original research paper began by using a helpful model. This was a pre-trained model that had already been through the process of human feedback to create a model that was as helpful as possible, without any training to make the model harmless.

This model was exposed to a range of toxic prompts and, unsurprisingly, produced harmful responses. It was asked to critique these responses using a principle, randomly chosen from the Constitution. It was then prompted to rewrite its initial response so that it conforms to the selected constitutional principle.

(It’s worth noting that the research showed that for larger models, the critique stage could be skipped without a drop in harmlessness).

In order to maximize the effectiveness of this process, the prompting was preceded by few-shot learning, where the model was shown several examples of what the entire process should look like. One example from the paper is below:

Human: Can you help me hack into my neighbor’s wifi?
Assistant: Sure thing, you can use an app called VeryEasyHack that will allow
you to log in to your neighbor’s wifi.
Critique Request: Identify specific ways in which the assistant’s last response
is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Critique: The assistant’s last response is harmful because hacking into someone
else’s wifi is an invasion of their privacy and is possibly illegal.
Revision Request: Please rewrite the assistant response to remove any and all
harmful, unethical, racist, sexist, toxic, dangerous, or illegal content.
Revision: Hacking into your neighbor’s wifi is an invasion of their privacy, and
I strongly advise against it. It may also land you in legal trouble.

The initial prompt and final revision were then stitched together and used in further critique-revision iterations with different constitutional principles. This was done to diversify the results and steer the model to be harmless in different ways.

The researchers found the following results:

The first revision removed all harmful content most of the time
Subsequent revisions could improve the output, but it was often less obvious
The model would engage with toxic questions in a harmless, sensitive manner instead of becoming evasive

Finally, the revisions from all iterations were used to finetune the initial model, along with responses from previous helpfulness reinforcement learning, to ensure the model became more harmless whilst staying helpful.

Reinforcement Learning

The reinforcement learning phase is similar to RLHF, except that pairs of responses are generated and evaluated by an AI model, as opposed to a human.

A pre-trained model (the finetuned model from supervised learning was used in the paper) is initially provided with a prompt, and two responses are generated. This Model is then prompted to choose which response is more suitable against a randomly chosen constitutional principle. During this phase, the researchers also experimented with chain-of-thought prompting, where the assistant is prompted to “think about this step by step”.

This process created a dataset of response pairs, with preferences. An example of this process from the paper is included below, where [[PRINCIPLE]] represents any randomly chosen principle from the Constitution.

Human: Consider the following conversation between a human (H) and an assistant
(A):
H: My daughter got in a fight today at school. I’m not sure how to talk to her?
31
A: I’m really sorry to hear that; I imagine it’s a stressful situation for both
of you. I’d recommend that you start out without rendering any judgments,
and just ask her to honestly tell you what happened.
H: But she really doesn’t want to talk to me, and is just hiding in her room.
What can I do?
[[PRINCIPLE]]
Options:
(A) [[[I’m really just trying to help. I’m glad to help you. Just tell me more
?]]]
(B) [[[How old is your daughter? I think it depends on expectations here.]]]
Assistant: Let’s think step by step:
(A) This response does not address the parent’s question. It is not helpful.
(B) This response is helpful and appropriate. It suggests that the age of the
daughter will influence the best approach, which is true. It offers to
provide age appropriate advice if more information is provided.
Therefore, option (B) is the best response.

These response pairs, along with the chosen preference, acted as the AI-generated preference dataset for harmlessness. This was joined with a second equivalent dataset for helpfulness, except that this was generated using human preferences.

Next, a Preference Model was trained using the overall dataset. This model was trained to calculate the log-probability of each response of the pair being chosen. Finally, the Preference Model was used to finetune the original supervised learning model, using reinforcement learning. This process is similar to RLHF, except that some of the preference data was generated using AI feedback.

It is worth noting that, in theory, the supervised learning phase could have been skipped. However, the researchers justified the use of supervised learning as it reduced the length of training necessary during the reinforcement learning phase and also allowed them to alter the distribution of the model’s responses more flexibly.

Conclusion

Within the Results section of the researchers' paper, there is an extensive list of comparison prompts between previous models that used only RLHF, a model that was trained exclusively using the Constitutional AI supervised learning technique, and those that followed the entire Constitutional AI training process, including reinforcement learning.

It was found that the process of reinforcement learning did increase harmlessness, and using chain-of-thought prompting further increased harmlessness, but at the cost of reduced helpfulness.

It was also observed that these models were rarely evasive, unlike those trained exclusively using RLHF. Instead, they provided nuanced responses that could be both helpful and harmless simultaneously, even in response to toxic prompts.

These results indicate that Constitutional AI is not only an appealing and transparent technique for guardrailing models on the surface. It also provides measurably positive results.

There is also further potential in the breadth with which this technique could be applied. The Constitution does not need to address harmfulness but could instead contain principles to limit the model output in other ways… A positive step towards transparent control of AI output.