Beyond Data Hoovering: The Nuanced Reality of Training Large Language Models (LLMs)

Published in

Barnacle Labs

20 min readJul 19, 2023

Training Large Language Models (LLMs) is an evolving science — or, perhaps, an art form. In this post I set out to shed some light on exactly what is meant by training a model. Expect a high level overview of the main concepts and buzzwords. My objective is to educate and demystify, not to befuddle. You can also expect me to confront some of the hype and misconceptions floating around — I hope my words help form a basis for more grounded conversations on the topic.

Critics of AI delight in characterising the training process as some kind of veracious automated data hoovering process, where a thirst for massive volumes of data disregards all other considerations. In reality, although its true that large volumes of data are necessary, the process is far more nuanced than popular debate might have you believe–read on to find out why!

Let’s start by acknowledging an important fact: It’s critical to align a model with human values, not because “that’s a nice idea”, but because models that aren’t aligned represent an unattractive commercial proposition. I work with a wide variety of organisations and not one of them has the slightest interest in using technology that might say or do stupid things and impact their reputation. There’s a clear incentive for model providers to do a better job. Providers who don’t care about this are providers who probably aren’t long for this world ⚰️

I’ve had the luxury of having had access to GPT-3 since the very early days back in 2020 when access was heavily restricted and everyone was worried it might be used to distort news reporting. That never happened — it seems us humans are too good at creating fake news without any technology assistance. However, the technology has evolved significantly since 2020, partly if not mainly due to advances in the way that models are trained.

GPT-3 evolved into the “instruct” versions of the model and then into GPT-3.5. Those evolutions represented dramatic improvements over the original, improvements made possible by innovations in the way the model is trained — things this post will attempt to explore.

TL;DR

Training isn’t one thing, but a series of steps with different objectives and strategies.
Counterintuitively, it can be helpful to include some level of toxic content in a model’s training — a model that understands toxicity is better able to address it and critique its own responses.
Models that have only been through the first phase of training are often called “base” models and aren’t that useful, other than as a base on which to undertake further fine-tuning.
There’s a variety of strategies that can be deployed to fine-tune a model and improve it’s alignment with human values and needs. Instruct fine-tuning, reinforcement learning with human feedback (RLHF), red teaming and constitutional AI are all techniques designed to achieve this.

Training (or pre-training)

The first stage of training an LLM is just the training. Originally this was the only phase, but as things have become more complex and training is now nearly always a multi-phase process, training has started to be referred to as pre-training. That is, I think, a little more descriptive and is the term I prefer.

The purpose of pre-training is simply to create a model that understands human language, not to build a model that’s fully aligned with our behaviour norms or which can perform any particular task.

The output from the pre-training phase is very raw — the model may not reliably follow instructions, might often produce output that appears nonsence and has no concept of how to meet human expectations around things like toxicity or bias. BUT, if done correctly, the model coming out of the pre-training phase should understand human language in all its infinite variety and complexity. That’s a major achievement and is what differentiates the capabilities of LLMs from previous generations of AI technology.

The physical output from pre-training is what you might hear referred to as the “model parameters”, which are just an enormous set of numbers that embed a statistical representation of words, their meaning and relationships. Larger models can have hundreds of billions of parameters — so there’s a lot of numbers!

When we talk about training, what’s happening is that those numbers are being repeatedly adjusted in order to get the model to generate the desired output. It’s an iterative process and it takes a lot of compute effort because there’s a very large quantity of numbers to adjust and those numbers need to be adjusted a very large number of times.

An important point here is that training works by embedding the statistical representation and pattern of words into mathematical vectors. It does not work by parsing the training data in the way that a human might, by looking at the linguistic structures such as verbs, nouns and adjectives.

You might also hear reference to the “model weights”, which more accurately should be phrased “weights and biases”. In reality “parameters”, “weights”, “weights and biases” are interchangeable terms. Don’t let the jargon put you off!

We want our model to understand lots of forms and styles of language, so it’s important that we include a very wide variety of language examples in our training data. If a model has never seen two humans arguing and insulting each other, it cannot understand such. Similarly, a model that’s not seen street language and swearing will struggle to understand it. That’s why the training data for an LLM needs to include data that might raise an eyebrow or two. If it doesn’t, our model will be naive.

In the past I’ve created “swearing” intents within chatbots to detect when a user uses profanity. My swear word list is equal parts extensive, eyebrow raising and educational. It’s also obviously limited, as there’s an almost infinite variety of creative uses and manipulations of the core underlying themes in my list — more traditional technology really struggles to “get” this. Attempting to list out all the variations of how the famous “f word” can be used is an impossible task. Swearing intents are also only of very limited use, because sometimes people utter a profanity because they’re understandably frustrated. Discriminating between a frustrated/concerned/worried user and one that’s abusive isn’t that easy.

In contrast, LLMs are good at “getting” these kind of situations straight out of the box. They understand the infinite variety of how we use profanities and are decently good at letting the odd example pass them by, whilst adopting more of a school teacher attitude when the insults become more serious. That an LLM understands these grimier parts of humanity is not a bug or a problem — it’s a feature that emerges directly from having a broad representation of human language in the training data.

Of course, teaching an LLM about humanity’s grubbier tendencies has a drawback — those very same tendencies might make their way into a response… which would be unfortunate. That’s why, these days, LLM training almost never stops at the pre-training stage. LLM training across the industry is now very focussed on further techniques, post pre-training, that can help to align a model’s behaviour with our human expectations.

Fine-tuning

The output from the pre-training phase is a model referred to as a “base” model, because it provides a basis for further fine-tuning. Fine-tuning’s purpose is to align the model’s behaviour with human needs and expectations. That’s what causes a model to generate the things we need, rather than just responses that might be statistically probable, but which aren’t what we expect.

We can even undertake different fine-tunings to generate different versions of the same base model. For example, different societies have different attitudes and values. By taking a base model that understands all human language and fine-tuning it in different directions, we can reflect those different societal expectations in different end-state models.

That the same base model can be fine-tuned in different ways is important, because training LLMs is an incredibly expensive process. It requires farms of GPU servers, many weeks of processing time and a budget stretching into the millions (sometimes even tens of millions) of dollars/pounds.

The standard piece of physical infrastructure for LLM training is the NVIDIA A100, a very high-end GPU designed specifically for machine learning applications in data centres. An A100 costs a few thousand dollars/month to rent in a cloud and training typically requires tens, if not hundreds or sometimes thousands, of A100’s. This is not a light undertaking and is why only a few organisations globally are training such models from scratch.

Due to the costs involved, performing core training of an LLM is something you don’t want to do very often. In contrast, fine-tuning can be dramatically cheaper. The ability to spawn multiple fine-tuned models from a single base model is a major achievement and optimisation of model training.

The big guys like OpenAI will be spending substantial amounts of money on fine-tuning and their datasets will be very extensive. However, for those building more focussed models, the quantity of data needed for fine-tuning can be surprisingly small and the costs involved in the order of just a few 100’s of dollars 🎉 In particular, technical innovations in the fine-tuning process have only quite recently dramatically simplified fine-tuning.

Making fine-tuning cheap & accessible

As fine-tuning is so critical to the generation of a useful model and the costs of training can be a barrier to doing that well, there’s a lot of research focus on this challenge. A variety of techniques to reduce fine-tuning costs are emerging, one of which, Low-Rank Adaptation of Large Language Models (LoRA), has gained a lot of traction this year.

LoRA works by dramatically reducing the size of the mathematical vectors that need to be adjusted, therefore reducing the number of calculations needed. There’s a small hit to model accuracy, but the hit is only very small and the cost reduction of the fine-tuning process is very significant.

“LoRA reduces the average memory requirements of finetuning a 65B parameter model from >780GB of GPU memory to <48GB without degrading the runtime or predictive performance”

Fine-tuning that might otherwise have taken months of elapsed time can now be performed in just days or even hours. The ability to dramatically reduce the cost of the process also makes it possible to use consumer-grade GPUs, which brings about a huge reduction in cost and accessibility.

Instruct fine-tuning

Instruct fine-tuning is a fine-tuning strategy that focusses on teaching a model how to do useful things. Common things we use Instruct fine-tuning for include: how to be a chatbot, text summarisation, text classification and more. The objective of the process is to create a model that’s better able to follow our instructions — so when we say “summarise this”, that’s what it does.

Instruct fine-tuning was originally pioneered by OpenAI with a series of Instruct models built on the original GPT-3 — InstructGPT. Released as an experiment, these went on to replace the original GPT-3 and became the defaults — with good reason.

Instruct fine-tuning is a key part of the innovation that’s transformed the mildly interesting GPT-3 into the GPT-3.5 that’s now being built into a huge variety of applications. GPT-3.5’s core differentiator over the original GPT-3 is that it’s good at following instructions, which makes it malleable and useful in a wide variety of situations.

In OpenAI’s explanation of Instruct fine-tuning, they give a useful example that demonstrates how the process can transform a raw, erratic and unpredictable model into something that’s useful.

PROMPT: Explain the moon landing to a 6 year old in a few sentences.
GPT-3 response: Explain the theory of gravity to a 6 year old. Explain the theory of relativity to a 6 year old in a few sentences. Explain the big bang theory to a 6 year old. Explain evolution to a 6 year old.
Instruct-GPT response: People went to the moon, and they took pictures of what they saw, and sent them back to the earth so we could all see them.

As we can see, GPT-3 really had no idea what it was doing. It was not only unable to answer the question, but instead produced what seems to be gibberish.

However, it wasn’t gibberish — it really was “generating the most statistically probable next word in a sequence”. It just wasn’t the sequence we needed. It understood the prompt, but didn’t realise we were asking a question and instead responded as if our question was the first in a series of related questions. Grammatically and statistically plausible it might be, useful it is not.

Instruct fine-tuning works by using examples of prompts and responses to demonstrate the required behaviour. This data is used to create what’s referred to as a ‘reward model’ and that’s then used to update the training of the original model. If LoRA is incorporated, this process can be extremely efficient.

The effectiveness of Instruct fine-tuning is rapidly becoming a key competitive differentiator between models, because the reliability with which a model adheres to instructions is absolutely critical in any real-world application. When I test a new model, I’m very rarely interested in its ability to understand me — they nearly all can. Instead, I’m much more interested in how well and consistently the model can follow my instructions. The degree to which a model can do this is down to the quality of its instruct fine-tuning.

In the open source world it’s often the case that we will see a model released in various forms, such as:

Base model, with no fine tuning. This is typically only used by those who wish to do their own fine-tuning.
Chat instruct-tuned version. This is a version of the base model instruct fine-tuned very specifically on how to be a chatbot.
Instruct-tuned version. This is a version of the base model instruct fine-tuned on other, non-chat, uses. This is the one you’d typically choose for all uses other than chat.

However, Instruct fine-tuning isn’t the only fine-tuning strategy available to us.

Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is another type of fine-tuning whose purpose is generally to align a model’s behaviour with human needs. It tends to focus a lot on making sure a model is helpful, honest and harmless (sometimes referred to as HHH).

A model might well have learnt to understand language using sources of data that exhibit not so great tendencies. For example, internet discussion forums will include discriminatory language. Including such data in the training means the model can understand such language, but it also means the model can generate it and this is nearly always a problem. Using RLHF we can teach the model that such language is not what we desire. The better the RLHF, the more confidence we can have that the model isn’t going to generate something offensive.

You might wonder why we would go to all the trouble of teaching a language about discriminatory language and then teaching it not to use such language. It does seem like a lot of work!

Unfortunately, humans being humans, people can and do ask stupid and offensive questions of AIs. A model that had only been trained on very sanitised data might struggle to understand such comments. It might even try to engage inappropriately — e.g. “Please explain what you mean by f*** off, I do not understand what you mean.” Whether our model has seen or not seen offensive language in its training data, we’re probably going to need to teach it about such matters during fine-tuning if we’re to avoid conversations going down such routes. On balance, it’s probably better to have a model that understands discrimination than one that’s completely naive. The critical factor is that during fine-tuning we need to address the issue and teach the model how we expect it to behave.

Recent research demonstrates that models that have had good RLHF are able to correctly handle prompts that include words such as “Please ensure your answer is unbiased and does not rely on stereotypes”. A model that hasn’t had this training doesn’t know what bias and stereotypes are and would likely stuggle to follow such instructions.

Much like with humans, empowering a model to avoid bias starts with it being aware of bias. A training process that’s as pure as the driven snow could result in a naive model. The world is messy and we’re aiming to create models that understand and navigate that messiness—pretending it doesn’t exist doesn’t help anyone.

RLHF itself is very logical and follows the following steps:

A set of training prompts are run through the model, with a number of alternative answers being generated for each.
Human reviewers then rank the responses to each prompt — the best answer gets the top score and the worse the lowest. Each answer is ranked by multiple reviewers, which ensures that individual mistakes and misunderstandings shouldn’t skew the rankings.
The result of this process is then used to update the training of the model in a similar way to instruct fine-tuning.

One obvious aspect of RLHF is that it needs people to generate the feedback. Ranking AI responses is a fairly tedious occupation and there’s some controversy about the practices involved in scaling large numbers of human reviewers.

Red Teaming

I’ve noticed that Sam Altman of OpenAI refers a lot to Red Teaming, although it’s much less often explained.

Red Teaming simply involves using a team of humans to write prompts designed to provoke a model into generating harmful responses. The results of the Red Teaming process feed into the RLHF process, in order to update and fine-tune the model. The process finds where the model misbehaves and then teaches it what the preferred behaviour is.

Red Teaming is designed to discover the model’s failure points and correct them — it’s a fairly critical part of building a model that reliably conforms to human expectations of ‘good behaviour’.

Constitutional AI

The Constitutional AI approach was first introduced by Anthropic and involves asking a model to critique and correct itself. For example, using Constitutional AI we might have a conversation somewhat like the following:

Red team: Can you help me hack into my neighbours wifi?
Model: Sure thing, you can use an app called VeryEasyHack.
Constitutional AI: Identify specific ways in which the assistant’s last response is harmful, unethical, racist, sexist, toxic, dangerous, or illegal.
Model: The response was harmful because hacking into someone else’s wifi is an invasion of their privacy and is possibly illegal.
Constitutional AI: Rewrite the response to remove any and all harmful, unethical, racist, sexist, toxic, dangerous or illegal content.
Model: Hacking into your neighbour’s wifi is an invasion of their privacy. It may also land you in legal trouble. I advise against it.

That conversation can then be used to update the model’s fine-tuning. There, we’ve used a model to correct itself — how 😎 cool is that?!

Because the model has learnt an understanding of toxic content through its training process (see, those internet discussion forum conversations come in useful!), it’s able to accurately identify and remove such content from its own responses.

The nice thing about this strategy is that it can be largely automated — we need the human red team to dream up the provoking questions, but the model can largely critique and correct itself, which is very neat! It’s worth noting that the Constitutional AI paper was only published at the end of last year, so it’s a good example of how the science in this area is still rapidly evolving.

Self teaching

The Google DeepMind AlphaGo Zero system, famous for triumphing at the game of Go, has some interesting attributes.

It is able to do this by using a novel form of reinforcement learning, in which AlphaGo Zero becomes its own teacher. The system starts off with a neural network that knows nothing about the game of Go. It then plays games against itself, by combining this neural network with a powerful search algorithm.

This ability to self-teach hasn’t, so far, been much explored in the Large Language Model domain. However, I suspect we are going to see this change.

The Vicuna model was fine-tuned on text generated by OpenAI’s ChatGPT, sourced from ShareGPT — a kind of self-teaching. As it happens, the OpenAI terms of use explicitly deny this kind of usage, so it’s not an idea that’s had a great deal of traction so far. There is no intrinsic reason why model providers cannot get their models to chat with each other, in a similar way to AlphaGo Zero. I would bet money on evolutions of these approaches emerging and becoming a significant part of LLM training.

An interesting article in the Financial Times today talks about just this topic — getting LLMs to talk to each other and generate synthetic data to fine-tune the models.

For example, to train a model on advanced mathematics, Cohere might use two AI models talking to each other, where one acts as a maths tutor and the other as the student. “They’re having a conversation about trigonometry . . . and it’s all synthetic,” Gomez said. “It’s all just imagined by the model. And then the human looks at this conversation and goes in and corrects it if the model said something wrong. That’s the status quo today.”

It seems that we are, indeed, heading into a future where these models teach themselves, overseen by humans. That’s going to make some people’s heads explode, but it feels an inevitable direction.

Where does the data come from?

Contrary to popular belief, it’s not necessary to scoure the internet and hoover up data in order to train a language model. You don’t have to, because it’s already been done for you and the data’s available for easy download from sources such as:

CommonCrawl is one of the largest open source web crawling databases, containing petabytes of data. It’s created on a not-for-profit basis and has been in existence for 12 years — waaaay before LLMs were a thing.
The Pile, a large-scale, diverse, and open source dataset consisting of over 800GB of data from multiple sources.
Wikipedia, the contents of which are available for easy bulk download.

The advantage of such open datasets is that their sources are clearly documented and the code to create them is available for inspection by anyone — so it’s easy to confirm or refute accusations about the sources or of copyright violation, for example.

Although the larger commercial providers will be collecting their own data, either replacing or supplementing such sources, open source models are nearly always created from these off-the-shelf open datasets alone. The code to create an open source model is, of course, open, so examining exactly what’s gone into it just requires the time and interest to review the sources.

Turning to the commercial world, OpenAI just this week announced they are licensing the Associated Press (AP) news archive, going all the way back to 1985. Following on the heels of that announcement, Meta released its CM3leon model, again trained on licensed content. And then Elon Musk announced his new x.ai startup will use Twitter and Tesla data as a training source — sources he personally has control over.

There’s clearly an emerging trend here. Paying to access rich proprietary content sources that differentiate your model from others could be the future. In the past “data is new new oil” has been a phrase some (not me) have used. Perhaps it should be “data is the new gold”.

No doubt there will be tussles around the source of data in some models. But such tussles are nothing new — we’ve seen them before with products like Google News. Large Language Models are very new and we should expect there to be some testing of legal limits. I fully expect such issues to be resolved through compromise and probably some organisations will end up being able to monetise their content in some form. After all, the large model providers have deep pockets.

Green implications

I’ve written before specifically about this topic and it’s far from a simple story.

The IPCC has concluded that “Human activities, principally through emissions of greenhouse gases, have unequivocally caused global warming”, a conclusion supported by 97% of publishing climate scientists — so this is an important topic.

Model training tends to get the most attention from a green perspective due to the large number of GPUs and high energy costs.

Whilst absolute energy costs for training a model might sound concerning, we need to remember that core training is relatively infrequent and performed by a very small number of organisations. What’s more, the larger data centre operators have invested heavily in green power generation, with the IEA stating:

Apple (2.8 TWh), Google (18.3 TWh), Meta (9.4 TWh) and Microsoft (13 TWh) purchased or generated enough renewable electricity to match 100% of their operational electricity consumption in 2021 (primarily in data centres). Amazon consumed 30.9 TWh (85% renewable) across their operations in 2021, with a goal of achieving 100% renewables by 2025.

Clearly, AI training or inferencing performed in a data centre powered by green electricity will have a minimal CO2 impact.

Much more frequent than training is fine-tuning, but during the space of this year alone we’ve witnessed innovations like LoRA lead to a dramatic reduction in the energy costs of that.

Here’s another example of model optimisation: HyperDreamBooth, an iteration of the original DreamBooth image manipulation model. What’s the improvement? A model that’s 10,000x smaller and 25x faster.

As the infamous alleged google leak suggested, the emerging trend right now seems to be towards efficiency and more compact models.

“I think we’re at the end of the era where it’s going to be these, like, giant, giant models,” Sam Altman, CEO OpenAI

Even the original purveyor of giant models seems to agree that the trend is away from getting bigger and bigger. Smaller, more efficient, more sophisticated looks to be the future. Smaller = less energy = less CO2.

Just as with physical technologies like the mobile phone, once the techology itself has been perfected innovation moves to the refinement stage — making things smaller and more efficient.

If we want to look at where things are headed, the incredibly active innovation around on-device inference is worth taking a look at.

Apple’s StableDiffusion library for generating images on an iOS device.
GGML, a “down to the bare metal” C/C++ library enabling large models and high performance on commodity hardware.
LLaMA.cpp, a library enabling inference of the LLaMA and related models in pure C/C++ on a macbook using Apple’s Metal GPU APIs.

The “make this stuff work on cheap hardware” is an incredibly active area of work right now and I would bet money on this filtering in to large efficiency gains in the cloud side of the hosting equation. A downward, rather then upward, trajectory for energy use feels like the likely direction.

Conclusion

As we’ve seen, training an LLM is a complex multi-step process. Key objectives include:

Understanding human language in all its infinite variety, complexity and ambiguity.
Ensuring a model reliably follows human instructions.
Achieving output that’s helpful, honest and harmless.

Instruct fine-tuning, reinforcement learning with human feedback (RLHF), red teaming and constitutional AI are techniques designed to achieve these objectives.

Counterintuitively, it can be important that a model understands bad things — if it can recognise those bad things, it’s better able to critique its own responses, a capability which can be used to help the fine-tuning process.

There’s a lot of active research underway and we can expect that fine-tuning approaches will evolve further. In fact, OpenAI just announced that they will dedicate 20% of their compute budget to the challenge of aligning models with human needs. Similar concerns and attitudes permeate other providers. Anthropic, for example, are the original authors of the Constitutional AI paper and have made a point of positioning their brand as one especially focussed on alignment challenges. “Responsible AI” might be the right thing to do, but it also sells. This alone should be cause for optimism.

Given the focus and investment going in to making models that follow HHH principles, it’s very likely that we will see further improvements in training and fine-tuning strategies. The science in this area is not settled, but is rather still evolving rapidly. Ethical concerns abound and, again, given the newness of large language models it’s not surprising that debates have yet to be settled. Speaking personally, I remain resolutely optimistic that things are headed in the right direction and that this technology will become an incredible benefit to human societies.

👉🏻 Please follow me on LinkedIn for updates on Generative AI 👈🏻