Member-only story
Bias, Toxicity, and Jailbreaking Large Language Models (LLMs)
A review of recent research on concerning characteristics of LLMs
CONTENT WARNING: This post contains examples of biased, toxic text generated by LLMs.
This post provides a deep dive into recent research on bias, toxicity, and jailbreaking of large language models (LLMs), especially ChatGPT and GPT-4. I’ll discuss the ethical guidelines companies are currently using in LLM development and the approaches they use to try to safeguard against generation of undesirable content. Then I’ll review recent research papers studying toxic content generation, jailbreaking, and bias from multiple angles: gender, race, medicine, politics, the workplace, and fiction.
Bias refers to prejudice in favor or or against a specific group, person, or thing, while toxicity refers to disrespectful, vulgar, rude, or harm-promoting content. LLMs are biased and have the capacity to generate toxic content because they are trained on vast quantities of Internet data, which unfortunately represents both the good and bad sides of humanity, including all of our biases and toxicity. Thankfully, developers of LLMs like OpenAI and Google have taken steps to reduce the chances of LLMs producing overtly biased or toxic content. However, as we will see, that doesn’t mean the models are…