How Large Language Model Works (LLMs: Zero-to-Hero)

Wayland Zhang
9 min readJan 24, 2024

--

This is the 2nd article in my Zero-to-Hero series. In this article, we aim to provide an accessible explanation of how a large language model (LLM) operates.

How LLM Works

Let’s first take a look at what a document completer model works like this:

user prompt:

A banana is

model response:

an elongated, edible fruit

Then, a document generator model works like this:

user prompt:

I want to buy a new car

model response:

What kind of car do you want to buy?

Note the differences between the two above.

First model is just a document completer where it will only complete the prompt with what it finds with the highest possibility to become the next character. This is the model which we trained on the chunk of internet data, it’s called the base model.

The second model is a document generator where it will generate a response more like a human based on the prompt question. This is the ChatGPT model.

The ChatGPT model is an inference model that can generate a response based on the prompt question. I’ll say its 99% the base model, but with two extra steps of training: A fine-tuning step and a reinforcement learning from human feedback step.

Pre-training: Base Model

This constitutes the very core of the AI revolution and is where the magic truly lies.

Training a model is a process of feeding it with a lot of data and let it learn from it.

As described in the GPT-3 paper, the base model is trained on a large chunk of internet data. That’s not an easy task for any individuals like you and me. It not only requires obtaining the data, but also requires a lot of computing power like GPU and TPU.

But don’t worry, we can still learn to train a small GPT model on our own computer. I’ll show you how to do it in the next topic.

The innovation behind LLM training lies in the introduction of the Transformer architecture, which enables the model to learn from vast quantities of data while preserving crucial contextual relationships between different parts of the input.

By maintaining these connections, the model can effectively infer new insights based on the provided contexts, whether they be individual words, sentences, paragraphs, or beyond. With this capability, LLM training has opened up new opportunities for natural language processing and generation tasks, allowing machines to better understand and respond to human communication.

The transformer architecture used to train the base model is shown below:

This is a neural-network-based model training with some old and new techniques: tokenization, embedding, position encoding, feed-forward, normalization, softmax, linear transformation, and most importantly, multi-head attention.

This part is that you and me are all mostly interested in. We want to clearly understand the idea behind the architecture and how exactly the training was done. So from next article and beyond, we will start dig into the paper, code and mathematics that used in training the base model.

Fine-tuning: Train the Assistant

Fine-tuning is a very smart implementation. I guess it’s first done by OpenAI. The idea is super simple but works intelligently: hire human labelers to create lots of Q&A conversation pairs (like 100k conversations). Then feed the model with the conversation pairs and let it learn from it.

This process is called Fine-tuning. You know what happens after those 100k sample conversations are trained into the model? The model will start response like a human!

Let’s take a look at those sample labeled conversations:

Human labeled Q&A

Q: What is your name?
A: My name is John.
Human labeled Q&A

Q: What's the capital of China?
A: China's capital is Beijing.
Human labeled Q&A

Q: Summarize the plot of the movie Titanic.
A: The movie Titanic is about a ship that sinks in the ocean.

Whoa, these sample Q&As are mocking the way we talk to each other.

By teaching the model these respond styles, the probability of the related contextual response will become very high and become response to a user’s prompt. Through training the model in various conversational styles, we increase the likelihood that it will provide relevant and contextually appropriate responses to prompts.

This is how language models can appear so intelligent and human-like; by learning to mimic the rhythms and patterns of real-world conversations, they can convincingly simulate a back-and-forth dialogue with users.

At this step, we can say we obtained an Assistant Model.

Below is a diagram of showing some highlights from pre-training the Base Model to the Fine-tuning the Assistant Model:

(from Andrej Karpathy’s build a GPT model from scratch))

RLHF: Reinforcement Learning from Human Feedback

On January 2022 OpenAI published their works on Aligning language models to follow instructions. In their blog post they describe how the model was futhur fine-tuned with human feedback:

This one is a bit tricky. The idea is to let the model learn from human feedback. Instead of providing ~100k labeled Q&A pairs, they gather user’s prompts and model responses, then let human rank them. Having the ranked conversations as the most-desired Q&A samples, then feed them to the model again and let it learn from it to improve its overall performance.

This process is introduced by OpenAI on its blog:

To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.

Here is a base model vs fine-tuned/RLHF response comparison:

You can see without fine-tune and RLHF, the model is just a document completer.

Prompt Engineering

Even with the fine-tuning and RLHF, the model still needs some help to get the desired response. And this is where the prompt engineering comes in.

In simple words, we can carefully design the prompt to get the desired response from the model (sometimes even without fine-tuning).

If you are not trying to dive too much into the mathematics and code, then prompt engineering is the good way to pay more attention, because it can get the best out of an LLM model simply by typing a better prompt.

Now let’s look at an example:

prompt:

The sky is

output:

blue.

Let’s try to improve it a bit:

prompt:

Complete the sentence:
The sky is

output:

blue during the day and dark at night.

By including some instructions in the prompt, the model will know what to do and what to response.

Let’s look at another interesting example:

prompt:

When I was 6 my sister was half my age. Now
I’m 70 how old is my sister?

output:

35

The answer is wrong. Correct answer should be 67. It looks like the model understands the questions but refers to a math calculation instead of logical inference.

Without fine-tuning and RLHF, we can get the correct answer solely by adding more example instructions to the prompt:

prompt:

Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today. After they are done,
there will be 21 trees. How many trees did the grove workers plant today?
A: We start with 15 trees. Later we have 21 trees. The difference must be the number of trees they planted.
So, they must have planted 21 - 15 = 6 trees. The answer is 6.
Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are in the parking lot?
A: There are 3 cars in the parking lot already. 2 more arrive. Now there are 3 + 2 = 5 cars. The answer is 5.
Q: Leah had 32 chocolates and her sister had 42. If they ate 35, how many pieces do they have left in total?
A: Leah had 32 chocolates and Leah’s sister had 42. That means there were originally 32 + 42 = 74
chocolates. 35 have been eaten. So in total they still have 74 - 35 = 39 chocolates. The answer is 39.
Q: Jason had 20 lollipops. He gave Denny some lollipops. Now Jason has 12 lollipops. How many lollipops
did Jason give to Denny?
A: Jason had 20 lollipops. Since he only has 12 now, he must have given the rest to Denny. The number of
lollipops he has given to Denny must have been 20 - 12 = 8 lollipops. The answer is 8.
Q: Shawn has five toys. For Christmas, he got two toys each from his mom and dad. How many toys does
he have now?
A: He has 5 toys. He got 2 from mom, so after that he has 5 + 2 = 7 toys. Then he got 2 more from dad, so
in total he has 7 + 2 = 9 toys. The answer is 9.
Q: There were nine computers in the server room. Five more computers were installed each day, from
monday to thursday. How many computers are now in the server room?
A: There are 4 days from monday to thursday. 5 computers were added each day. That means in total 4 * 5 =
20 computers were added. There were 9 computers in the beginning, so now there are 9 + 20 = 29 computers.
The answer is 29.
Q: Michael had 58 golf balls. On tuesday, he lost 23 golf balls. On wednesday, he lost 2 more. How many
golf balls did he have at the end of wednesday?
A: Michael initially had 58 balls. He lost 23 on Tuesday, so after that he has 58 - 23 = 35 balls. On
Wednesday he lost 2 more so now he has 35 - 2 = 33 balls. The answer is 33.
Q: Olivia has $23. She bought five bagels for $3 each. How much money does she have left?
A: She bought 5 bagels for $3 each. This means she spent $15. She has $8 left.
Q: When I was 6 my sister was half my age. Now I’m 70 how old is my sister?
A:

output:

When I was 6 my sister was half my age, so she was 3. Now I am 70, so she is 70 - 3 = 67. The answer is 67.

output 2:

When the narrator was 6, his sister was half his age, which is 3. Now that the narrator is 70, his sister would be 70 - 3 = 67 years old. The answer is 67.

Both answers are correct! We simply add some examples as logic explanations to the prompt, then ask same question again. The model now can understand the question and answer it correctly.

The above example was introduced by Wang et al. (2022)) that computing for the final answer involves a few steps.

Strong prompts can be used to guide the model to perform complex tasks, such as solving math problems or summarizing text. So prompt engineering also plays a very important role of the LLM ecosystem.

For more about prompt engineering, here is a good prompting guide tutorial.

Summary

You have been reading down here, I’m sure it took a while to digest all the information, especially for those who are new to the LLM world.

Now I believe we have covered sufficient ground in terms of basic concepts and background information. It is time for us to begin preparations to construct our very own Large Language Model. Enough with the theory, and we move forward to the crucial component of the Transformers architecture, in the next article.

--

--