Textbooks Are All You Need and Scaling Laws for LLMs

This blog will go in detail on the findings of the “Textbooks Are All You Need” paper.

6 min readFeb 1, 2024

--

Initial Shape of Scaling Laws

Once you have an LLM, there are different ways you can improve it. If you feel it is already a good part of the way to the performance you want, you can do fine-tuning, where you take a new dataset and perform an algorithm like QLoRA on it to make it better at performing items like your dataset. However, if the model is not at a level that is already fairly similar to what you want it to be at, fine-tuning will likely not be enough to substantially improve your model.

Figure 1 from the Scaling Laws for Neural Language Models paper illustrating the relationship seen between network complexity, compute, datasize and loss (a proxy for model quality)

The field has largely turned to 3 distinct ways to improve their models substantially — executing more training cycles (in other words spending more on compute during training), adding more parameters to their network (in other words, making the network more complex), and training on more tokens (in other words, using larger data sets).

Nevertheless, the Textbooks Are All You Need paper shows an interesting new way to improve your model — increasing the quality of the data you are training on.

Results from Focusing on Data Quality Above Traditional Scaling Laws

From a high level, this paper introduces the Phi-1, a LLM focused on outputting quality code when prompted. Phi-1 has 1.3B billion parameters and was trained on about 7 billion tokens. To put that into context, most small models have about 5x more parameters and nearly 100x more training tokens.

Nevertheless, Phi-1 performs well. Take a look at the following interaction from the paper:

Student: I have a Python pyplot, I want to increase its resolution and rotate it, what
should I do? TA:

the output generated by Phi-1 from the paper

To give a sense of comparative performance, the researchers compared Phi-1 and a few other LLMs on the HumanEval and Mostly Basic Python Problems (MBPP) benchmarks. When comparing the models, note the token size and model size (many of these models do not disclose the training cycles)

Table 1 from the paper showing comparative performance between models of varying parameter size and dataset sizes

LLMs are incredibly expensive to train — the more tokens and parameters your model has the more expensive it will be. Getting better results with fewer tokens and parameters can open the door to higher quality LLMs at a lower production cost.

How did the researchers accomplish this kind of performance? By focusing on higher quality data.

Overview of Data Created

The researchers created their data with a guiding philosophy: if a person would be confused using this data as a way to learn, then it shouldn’t be in the dataset. As such, they created 3 distinct data sets:

(1) The Stack and Stack Overflow datasets filtered so that they exclusively use the Python coding language and have high quality examples

(2) GPT-3.5 generated textbook sections showing exactly how to execute coding concepts in Python

(3) GPT-3.5 generated textbook exercises, with answers, in Python

Using sets (1) and (2) they would train their model, and then with set (3) they would fine-tune.

Let’s go through each section and explain what they did to create it and why they did this.

Data Set 1 and Filtering Data Sources for Educational Value

An example from the paper of 2 examples with differing educational value. The high-value entry was kept in, while the low-value was removed

The Stack and Stack Overflow datasets are on balance a good source of information on programming, however, there are flaws with it that the researchers wanted to address. First, the dataset is compromised of many different languages. If you were to teach a human how to code, showing that person many different languages would likely confuse them more than help them. Thus, they wanted to focus entirely on one language. Second, not every entry in the data was especially educational. For a good example of a high educational value vs low educational value, see the image above. Once all of this filtering was completed, they were left with roughly 6 billion tokens in the dataset. To put into context how small this is, LLaMa 2 was trained on roughly 2 trillion tokens (or about 3 orders of magnitude more data)!

Data Set 2 and Generating Good Examples

An example of a generated textbook section from the paper

For the textbook sections, they wanted to generate exercises that were diverse. For an example of the opposite, imagine you are studying a topic and you only ever hear it explained in one way. When you talk about the topic with some friends, you may be thrown off by a different vocabulary. In the LLM world, this can cause hallucination, so to avoid this they wanted to have as many different ways to frame topics as possible in the dataset. To get this right, they had to prompt GPT-3.5 in many different ways. This ended up being roughly 1 billion tokens.

Dataset 3 and Generating Good Problems

An example problem in the dataset from the paper

Similarly to generate the exercises, they wanted to maintain diversity to avoid the model only understanding problems framed in a specific way. To do so, they would have each docstring contain a description of the function and what it was meant to do, along with the correct answer beneath it. This ended up being roughly 180 million tokens.

Avoiding Data Contamination

At this point, you may begin wondering about this method’s susceptibility to data contamination — or the idea that test data can leak into the training data, effectively giving away the answers to the test before the test begins. As GPT-3.5 is potentially trained on HumanEval and other benchmarks, the research team tried three methods to avoid data contamination — N-gram overlap and embedding and syntax-based similarity.

N-gram overlap works by seeing if any n-number of words repeat in both datasets. When running, they found a few false-positive cases, causing them to turn to embedding and syntax-based similarity.

Embedding similarity requires using the same embedding agent on both data sets and then comparing the distances of the vectors. By using this methodology, code that was similar in terms of meaning would be flagged, rather than just code that has the same word sequences.

With the syntax-based similarity, they computed an abstract syntax tree for both data sets and then performed edit distance (how many operations to turn one tree into another) on the trees.

an Abstract Syntax Tree (AST) a tool used when compiling code which also happens to abstract variable names, allowing for easier comparisons between code datasets (courtesy of WikiMedia)

In the end, the team focused on embedding similarity and syntax-based similarity. Once both values were found, they set a threshold for the percent similar the data is on the embedding similarity and a match rate for the string distance to ultimately decide whether to reject or accept the data. With this filtering, they were confident that the data contamination will be kept to a minimum.

Conclusion

The big questions this paper raises revolve around data generation and the limits of using LLMs to generate data that trains other LLMs. Time will tell if LLMs can be used as a quality source for generating data and if they can also be a cost-effective way to do so. If they can, then it will be a major force democratizing the creation of LLMs.

[1] S. Gunasekar, et al., Textbooks Are All You Need (2023), arXiv

[2] J. Kaplan, et al., Scaling Laws for Neural Language Models (2020), arXiv