The dangerous imbalance of GenerativeAI development: Scale over safety

8 min readMar 27, 2023

“Most researchers and companies releasing these hugely powerful Generative models, do not even know what’s in the data they are trained on.”

Nobody knows, or cares, what data was used to train ChatGPT

Originally I wrote this article with the above title, but after the recent release of GTP-4, where the researchers from OpenAI flat out refused to shared any details on the model or indeed training data, I had to turn this into a wider piece. I will cover details on what we do know of the underlying training data for ChatGPT, as it demonstrates the problem quite well. But for now, i’ll focus on the undelying root of the problem.

left: riding a well trained horse, right: “Young bull riders try to hold on”

This is a call out that we are spending a disproportionate amount of time making these algorithms more and more powerful (through size: depth and compute) and nowhere near enough time looking at the actual data that goes in and the dangers that come with it.

The defacto approach is to push as much data as humanly possible into training (because typically, the more data these types of algorithm have, the better they can “learn”), dealing with the issues after the fact with downstream “fixes”. Data issues such as; bias, harmful/hurtful content, misogyny, racism, expletives, falsehoods etc. etc.

I’ll get into more detail on these “Downstream fixes” or what I call band-aids/plasters, later in the article. For now here’s a visual on some well known image bias found in the famous CommonCrawl dataset (LAION-400M), the same dataset that is a core component of the GPT training corpus.

Image from the paper: “Multimodal datasets: misogyny, pornography, and malignant stereotypes”

Open or Closed AI

There’s much Debate if the details (or the models themselves) should be released (the Open side of the argument) or if, because they have become so powerful, it is too dangerous* and we should keep the details within the respective research groups (the Closed side of the argument). *Dangerous because of the harmful applications and actors that may use these for harm; social engineering, disinformation campaigns etc.

Putting aside the sharing of algorithmic training methodology, I believe not disclosing details on the data that was used to train the model is utterly wrong and prohibits the wider research community from:

(A) Understanding what these powerful models have learnt from.
(B) Uncovering issues and fixes with that underlying data (Like we’ve seen to great effectiveness in the past).

On point B, we’ve typically seen that those who care about the ethical side of data and AI, typically come from outside these large tech companies (Or they have been forced out). Like the amazing researchers over at the distributed ai research institute (DAIR) or the algorithmic justice league.

Those unasked questions:

Thousands of posts on the topic of GenerativeAI but not a single one questioning these important questions:

“What data was it actually trained on”? (In detail Imean — not just “Large chunks of the internet”)
“Are there any things within that data we should be worried about?”
“What are the implications of after-the-fact-fixes, the Band aids that cover up the flaws originating from the original training data”

Don’t get me wrong, both ChatGPT and Stable diffusion are awesome (other language modes and image models are available) but due to the huge strides in capability — (I.e a chatbot with a sense of humour, who can write essays or a image app that can generate professional looking brand content) — we have forgotten to question the fundamentals.

I may come across as a disbeliever in AI, that’s absolutely not the case, I love our field, what the likes of OpenAI have done for it and pace of innovation. But, I believe we have created a massive imbalance between the amount of effort spent developing the algorithm vs the amount of effort we spend considering the data we use to train them and consequently, what they learn.

The same question you should ask of any algorithm — what data was it actually trained on?!

After all, AI can only learn from what it sees, from what us as humans feed into it.

High Level: What is GenerativeAI trained on?

I think most people have a vague idea that ChatGPT and DALLE-2/Stable Diffusion were trained on large subsets of the internet, and peoples opinion of the internet is mainly formed from what they consume on a day to day basis. We don’t consider (or admit) the biases, the hate or the explicit nature of large parts of the internet.

From a development perspective there are a wide range of datasets used, all of which are “huge chunks of the internet”.

What are the cleansing steps performed on training data?

There are a small number of “cleansing” steps that are used to try and resolve some of the aforementioned issues before training these huge models, however, the maturity and focus on these cleansing tasks is disappointing.

For example, from the GPT-3 paper:

(1) we downloaded and filtered a version of CommonCrawl based on similarity to a range of high-quality reference corpora,
(2) we performed fuzzy deduplication at the document level, within and across datasets, to prevent redundancy and preserve the integrity of our held-out validation set as an accurate measure of overfitting, and
(3) we also added known high-quality reference corpora to the training mix to augment CommonCrawl and increase its diversity

Now to quote the recent paper “Handling and Presenting Harmful Text in NLP Research” Feb, 2023

“Filtering, while well-motivated, must be cautiously approached because it can censor and erase marginalised experiences. For example, crudely removing language that could be considered offensive (e.g., any use of potentially reclaimed terms, such as “sex” or “gay”) risks excluding the language of entire communities who may use such terms to communicate about sexual health (Dodge et al., 2021).”

Although this is well intended, the potential implications of filtering content, especially if not done in a thorough and considered manner, are potentially as dangerous as not having done them in the first place.

Filtering has become an established practise, stemming back to the ongoing creation of common crawl. For example, C4, a well adopted training corpus across LLMs, was created by taking the April 2019 snapshot of Common Crawl6, carried out filtering in the following way:

the exclusion of documents that contain any word from a blocklist of “bad” words20 with the intent to remove “offensive language” (Raffel et al., 2020), i.e., hateful, toxic, obscene, sexual, or lewd content. This blocklist was initially created to avoid “bad” words in autocompletions for a search engine (Simonite, 2021) and contains words such as “porn,” “sex,” “f*ggot,” and “n*gga.”

In the above graph from the paper “Documenting Large Webtext Corpora” , you can see the impact on marginalised groups from something that was well intended but not fully thought through.

Downstream fixes: Post Base Model Band-Aids

Rather than spend more time analysing and applying sophisticated methods for cleaning training data, performing actions such as mitigating harmful bias, typically the approach to “Safety” has been applied after the main base model has been trained:

From the paper “Regulating ChatGPT and other Large Generative AI Models”

We should give a lot of credit to OpenAI, who have spent an awful lot of time and energy to build the right safety framework and prompt response moderation process, but this is still ultimately “fixing predictions after the model has made them”. These downstream sophisticated models such as content classifiers and RLHF (Reinforcement Learning from Human Feedback) are valuable but imagine if these were also combined with an additional focus on removing issues from the underlying model itself.

Who are these “Human in the Loop” labellers?

There are very little details on the specifics for ChatGPT’s RLHR, but in the original InstructGPT, the precursor to ChatGPT, the paper claimed that this “Human in the loop” feedback was collected from just 40 people (contractors) who OpenAI describe as being”

mostly English-speaking people living in the United States or Southeast Asia hired via Upwork or Scale AI.

What did they label:

training data is determined by prompts sent by OpenAI customers to models on the OpenAI API Playground, and thus we are implicitly aligning to what customers think is valuable and, in some cases, what their end-users think is valuable to currently use the API for.

again another warning from the researchers themselves

“However, this group is clearly not representative of the full spectrum of people who will use and be affected by our deployed models.”

Now, we can’t make concrete assumptions about the labelling that was executed for ChatGPT, but we can take learnings from the limiting factors of this approach and the challenges of overcoming them.

Final Thoughts

The power of LLMs is how they learn, from a mass of data, making connections and learning at such scale. To try and fix the underlying issues, caused by the underlying data, with the limitation of “Labelled data” and all the issues that come with it: scale, bias from labellers etc. seems like an oversight to me and a number of the AI community.

Putting pressure on the researchers to release details of the data and indeed the mitigation steps used to achieve their exhaustive safety framework is a must.

Thanks for reading 🙏

Image from Midjourney. Prompt: “**share your ideas”**

Do you agree/disagree? What did i miss? Let me know your thoughts!

References and Useful Links

What's in my AI?

A Comprehensive Analysis of Datasets Used to Train GPT-1, GPT-2, GPT-3, GPT-NeoX-20B, Megatron-11B, MT-NLG, and Gopher…

lifearchitect.ai

ChatGPT: How Does It Work Internally?

Launched by OpenAI on November 30, 2022, a new conversational assistant is making a buzz on the Internet with the…

pub.towardsai.net

GPT-3.5 + ChatGPT: An illustrated overview

GPT-4 released (14/Mar/2023). Read more. 👋 Hi, I'm Alan. I advise government and enterprise on post-2020 AI like…

lifearchitect.ai

Common Crawl - Wikipedia

From Wikipedia, the free encyclopedia Common Crawl is a nonprofit 501(c)(3) organization that crawls the web and freely…

en.wikipedia.org

#DataPrivacyWeek: ChatGPT's Data-Scraping Model Under Scrutiny From Privacy Experts

One use of ChatGPT, the superstar chatbot created by generative AI firm OpenAI, is drafting privacy notices…

www.infosecurity-magazine.com

GPT-3 Paper: Language Models are Few-Shot Learners

InstructGPT Paper: Training language models to follow instructions with human feedback

Documenting Large Webtext Corpora: A Case Study on the Colossal Clean Crawled Corpus

Regulating ChatGPT and other Large Generative AI Models