ChatGPT — Show me the Data Sources

Dennis Layton
7 min readJan 30, 2023

--

What are the data sources for ChatGPT ? We are all struggling with the heuristic nature of AI, but knowing where the data comes from might provide some insights into the kinds of outcomes we should expect.

There is also a wide spread misconception that OpenAI scoured the entire web, training as it went, when the truth is that there was a lot of data curation that went into preparing the data used for training and not all of it was done by OpenAI.

ChatGPT (GPT-3) Data Sources

The table shown below is from paper entitled: Language Models are Few-Shot Learners. It shows the datasets used to train GPT-3. The base model for ChatGPT. This information was not easy to find, and I expect that this will get more difficult in time. It’s also interesting to note that this actually a very small amount of data, approximately 570 GB of data.

Here is what we know. Let’s start with Common Crawl dataset, that made up the bulk of the training dataset. You can find out more about this dataset here https://commoncrawl.org.

The Common Crawl is an open, and free-to-use dataset that contains petabytes of data collected from the web since 2008. Training for GPT-3, the base model of ChatGPT took a subset of that data covering 2016 to 2019. This was 45 TB of compressed plain text before filtering and only 570 GB after. This is roughly equivalent to 400 billion byte_pair encoded tokens.

WebText2 is the text of web pages from all outbound Reddit links from posts with 3+ upvotes. Books1 & Books2 are two internet-based books corpora. Wikipedia pages in the English language are also part of the training corpus. Note that during training datasets that the Open AI team viewed as being higher-quality were sampled more frequently. As such the Common Crawl and Books2 datasets are sampled less than once during the training but the other datasets are sampled 2–3 times. This small amount of overfitting was the exchange for higher quality training data.

The size of the Common Crawl dataset is more than sufficient to train the largest models, however unfiltered or lightly filtered versions of the Common Crawl tend to be of lower quality than more curated datasets.

There was a three step process in preparing the data for training.

  1. Download and filter a version of the Common Crawl dataset based on a similarity to a range of high quality reference corpora.
  2. Deduplication at the document level, within and across datasets
  3. Added high-quality reference corpora to augment and increase the diversity of the Common Crawl dataset.

This all seems to go against the more conventional view that these models are as smart as they are because of the vast amount of data on the internet today. The most recent estimate, made by Eric Schmidt, the CEO of Google is the internet contains roughly 5 billion GB. Given this, why is using a very small subset of the data (570 GB) a better way of training these models.

In a nutshell, what has been learned over the last few years is that working with a smaller amount of high quality data with a larger model, often expressed in parameters, is a better way to go. The trend in language models has been to increase the number of parameters from 100 million in 2018 to the 175 billion parameter GPT-3.

Consider that GPT-2 and GPT-3 were trained on the same amount of text data, around 570GB, but GPT-3 has significantly more parameters than GPT-2, GPT-2 has 1.5 billion parameters while GPT-3 has 175 billion parameters. By increasing the number of parameters, the model is able to learn more complex patterns and structures of natural language, which allows it to generate more human-like text and understand natural language better.

As an aside, GPT-3 also uses more sophisticated techniques to learn the patterns and structures of natural language, such as attention mechanisms and transformer layers which were not present in GPT-2. It’s not all about the parameters.

It is interesting to note that this comparable to the way humans learn language. We typically only require a few examples to learn most of what we know at a reasonable level of competence. A few examples of what a tree is and that a forest consists of a number of trees was all that it takes. It’s our capacity to learn, that is more important than the number of examples we are given. This seems to hold true for machine learning as well.

The table below shows how given the same number of examples in context, how the accuracy improves (the model learns more) for larger models. This is one of the reasons we can expect that GPT-4 will be orders of magnitude larger than GPT-3.

How did ChatGPT (GPT-3) learn to write Program code ?

So what about programming, how did GPT-3 learn to code? It’s in the training data. GPT-3 can generate programming code because it has been trained on a large dataset of text that includes examples of programming code. This allows it to learn the patterns, structures, and syntax of various programming languages.

When generating code, GPT-3 uses its understanding of programming languages and its ability to generate human-like text to produce code that is syntactically correct and follows the conventions of the language. In a previous article, I showed how ChatGPT was able to discern a great deal from only a schema of a dataset including how to determine total sales, from a price and quantity data fields in the schema. That required more than knowledge about coding.

However, it’s important to note that GPT-3 is not a full-featured programming language model and it does not have the same level of understanding of programming concepts and logic as a human programmer. It is more of a code completion tool, it can generate code snippets that are syntactically correct and follow the conventions of the language, but it may not always understand the logic or purpose of the code it generates. Most importantly it cannot validate the code it generates, nor can it debug it on its own.

As an aside, I learned of these limitations, by asking ChatGPT. There is great sport at this time for critics of AI to find these same limitations by using ChatGPT and then use these as reasons for not relying on it, as though relying on a team member who has less than perfect skills in programming is not a good idea.

What are the DALL-E-2 Data Sources for Text to Image Generation

Before there was ChatGPT there was text-to-image generation and OpenAI’s model is known as DALL-E-2. The first version of DALL-E used an extension of the technology behind GPT-3, producing images by predicting the next pixel in an image as if they were words in a sentence. This worked, but not well.

For DALL-E 2, OpenAI used a diffusion model. Diffusion models are neural networks trained to clean images up by removing pixelated noise that the training process adds. The process involves taking images and changing a few pixels in them at a time, over many steps, until the original images are erased and you’re left with nothing but random pixels. The magic happens when the neural network is trained to reverse that process and predict what the less pixelated version of a given image would look like.

This process is guided by the language model that’s trying to match a prompt to the images the diffusion model is producing. This pushes the diffusion model toward images that the language model considers a good match.

So, where do the images come from that DALL-E-2 are trained on. Much like the Common Crawl, there is another free-to-use source called LAION. LAION contains billions of pairings of text and images scraped from the internet.

See: https://laion.ai/

Remember the Common Crawl dataset mentioned earlier in this article? LAION finds images by parsing the Common Crawl data, identifying all the HTML IMG tags containing an alt-text attribute. According to the LAION website, after filtering 50+billion candidates they are left with just under 6 billion, hence the data set is referred to as Laion5B.

So why does all this matter? Generative AI models are in part the product of their architecture and scale measured in parameters and layers but they also rely on high quality data that is well curated in advance. We are all adjusting to the fact that AI models are heuristic by nature, unlike programs based on algorithms the results are not entirely predictable, and often not reproducible. Knowing where the data that was used to train these generative models comes from, gives us some insight what the results may be.

That said, the question that remains is that with all the investments being made in AI, how much longer will we be able to find this information.

--

--

Dennis Layton

Dennis Layton is a Solution Architect and a proponent for the responsible adoption of AI technologies