Don’t Blame the AI: It’s the Training Data

Published in

RavenPack

5 min readMar 10, 2023

The level of expectations around language AI is through the roof. It is supposed to change and improve significantly how we work, learn, manage our money or health, and how we interact with technology overall. Literally every industry (not just tech) is hiring for Natural Language Processing skills, from agriculture to healthcare.

In 2022, according to RavenPack data, more than 7% companies were hiring for Natural Language Processing (NLP) skills, versus 4.5% in 2018, with the number of NLP-related jobs posting more than doubling in 4 years. By 2025, up to 97 million AI-related jobs are expected to be created across industries, according to the Future of Jobs Report by World Economic Forum.

The possibilities and potential seem endless.

And, yet…

High expectations can lead to high disappointment. ChatGPT set a record for the fastest app to reach 100 million users within 64 days since launch. It surpassed TikTok, which took nine months to reach that milestone. As more people are turning to AI, so are the questions, and it’s becoming increasingly obvious the answers they get can be misleading, incomplete, inaccurate, or even uncomfortably off putting.

The road from excitement to disillusion is paved with bad training data

The Internet went wild after the release of “Nothing, Forever” an AI-Generated ‘Seinfeld’ Show. Unfortunately, it was banned shortly after a transphobic stand up bit. Or you might remember the curious case of Tay — a Twitter conversational bot launched by Microsoft which was shut down shortly after for bigotry.

And more recently, the Washington Post wrote a report saying that Microsoft’s AI Chatbot is Going Off the Rails after its chatbot displayed a “bizarre, dark and combative alter ego”. The eyebrow-raising interactions with Bing have also been superbly documented by Stratechery who writes about an “engrossing, yet roguelike experience” or Simon Willison who offers examples of the demo being full of errors or how it gaslighted users.

Some of these models are brilliant — a comedy show that happens forever? Yes, please. The real issue was not how they were built, their structure, but the data they were trained on.

Training data is basically a set of examples used to teach a machine learning model to make accurate predictions. It serves as the foundation of the entire project and provides the basis for their models to learn from.

The success — and equally the failure — of any language AI application depends heavily on the quality of the training data used to develop them. “Garbage in, garbage out” is the mantra of any ML developer out there.

Trust in AI must be rooted in understanding

In order to add value in society, AI models must be trusted — from a very basic chatBot that is answering clients’ questions to the complex content generating applications.

I believe the best way to earn that trust is for AI developers to be transparent about how these models work. This means being able to account for the sources and mechanisms behind the outputs (algorithm explainability) and for the quality of the training data.

A lack of trust in AI language models is essentially a lack of viable training data that can produce coherent outputs. AI developers have the responsibility to make sure the training data they are using is helping build that trust, and, implicitly, the progress of this exciting new field. Luckily they don’t need to carry that responsibility alone.

The devil is in the data

Lately, I’ve been hearing more people talking about “the dark side” of AI — labeling that is dangerous because it is broad, obscure and not at all constructive. In order to move forward, we need to name the culprit: the training data. There is no “dark alter ego” in AI, it’s just a reflection of what we’re teaching it.

OpenAI hired Kenyan workers to filter harmful content and help train ChatGPT. Take into account that the dataset behind ChatGPT is gigantic: 176 billion parameters, totalling 570 GB of books, articles, websites and other textual data scraped from the Internet. Manually filtering all that content would have been a futile task. That is why they basically needed to build another AI application to keep that huge training data set the least toxic possible.

But surely there must be a better way — one that doesn’t have such a high human cost. And one that is actually more consistent and scalable.

There is a growing need for high-quality training data to support the development of advanced and trustworthy language AI applications, and — in order to be useful, language models need to accurately represent the world and its complexities.

The industry badly needs training data sets that are accurate, diverse, inclusive or, depending on the project, very specific and specialized. They are essential for language models to perform effectively across different use cases and avoid biases.

In order to deliver outstanding, powerful AI applications, ML developers need quality data — clean, accurate, unbiased, carefully annotated data, and based on taxonomies. They also need volume (a large data set allows for more accurate and reliable predictions), a constant flow of new training data (this helps ensure that the model is able to generalize better, and make more accurate predictions on unseen data) and, very importantly speed (eliminate the need to dedicate time and resources to manually annotating large corpuses of text, allowing AI models to quickly be trained with high accuracy).

While there are providers in the market that have the capability to produce training data, so far their offering has been limited in scope, offering mostly low volume, noisy datasets tagged by people — meaning incomplete data or data that may be wrongly annotated. That is why big data providers need to step up their game and meet the demands of today’s fast paced AI requirements.

So the next time a search engine or chatbot gives you a wrong or weird answer, remember — don’t blame the AI, it’s the training data that is letting you down.

Simply put, AI models are not infallible and they are only as good as the data they are trained on — inaccurate or strange answers are simply a reflection of the limitations and biases of the training data. By harnessing this understanding, we can strive to enhance the caliber of training data, which will inevitably culminate in more precise and efficient language models and to the growth of the AI space as a whole.

Don’t Blame the AI: It’s the Training Data

And, yet…

The road from excitement to disillusion is paved with bad training data

Trust in AI must be rooted in understanding

The devil is in the data

Written by Armando Gonzalez