We audited RoBERTa, an open source large language model

andrea b

Published in

high stakes design

2 min readDec 19, 2022

We found two security vulnerabilities and a backdoor.

Today IQT Labs’ published our second audit report, focused on a Large Language Model called RoBERTa.

Large Language Models (LLMs) have been in the press a lot recently, thanks to OpenAI’s release of ChatGPT. These models are tremendously powerful, but also concerning, in part because of their potential to generate offensive, stereotyped, and racist text. Since LLMs are trained on extremely large text datasets scraped from the internet, it is difficult to know how they will perform in a specific context, or to anticipate undesirable biases in model output.

Our report describes several concerns we uncovered while auditing this model, including:

Two security vulnerabilities in Jupyter notebooks & Jupyter server (which have since been patched).
RoBERTa’s proclivity to assign negative sentiment to rare names.
A backdoor we discovered, based on a subword from Saysiat, a language spoken in Taiwan. (We will present our paper on this work next week at the 5th Workshop on Big Data for CyberSecurity at the 2022 IEEE International Conference on Big Data.)

We also describe how we did the audit, building on the methodology we developed while auditing FakeFinder, a deepfake detection tool. For example:

We mined the AI Incident Database for previous failures of LLMs and used these incidents to help us construct an ethical matrix for RoBERTa.
We worked with BNH.AI to define general categories of bias and create a high-level bias testing plan.

For more information check out Interrogating RoBERTa: Inside the challenge of learning to audit AI models and tools or read the full audit report.

***

Photo by Jr Korpa on Unsplash.

We audited RoBERTa, an open source large language model

Written by andrea b