What is Large Language Model? (LLMs: Zero-to-Hero)

5 min readJan 24, 2024

This is the first article of my Zero-to-Hero series. In this article we will take a high-level overview of what precisely a Large Language Model (LLM) entails..

Background

In November 2023, I stumbled upon the news of OpenAI’s developer conference, an event that left me captivated by its groundbreaking advancements in artificial intelligence. Eager to delve deeper into this fascinating field, I signed up for ChatGPT, and it wasn’t long before I found myself utterly motivated to learn the technology behind the Large Language Models(LLMs).

But like most of us, beginning with zero knowledge of LLM, it’s hard to know where to start.

After hundreds of hours of reading and experimenting, I’ve sorted out the best resources to help us get started. Assuming you only have a basic understanding of programming and middle school mathematics, this learning path will guide you through the fundamentals of AI and machine learning, and then help you to build your own Large Language Model.

Model Definition

Large language models (LLMs) like ChatGPT are taking the tech world today. From Wikipedia, the definition of LLM is:

A large language model (LLM) is a language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by learning statistical relationships from text documents during a computationally intensive self-supervised and semi-supervised training process. LLMs are artificial neural networks following a transformer architecture.

in other words:

LLMs are trained on vast datasets of texts like books, websites or user generated contents. They can generate new text that continues an initial prompt in a natural way.

LLM model is basically a neural network with a lot of parameters. Roughly speaking, the more parameters, the better the model is. So we always hear about the size of the model, which is the number of parameters. For example, GPT-3 has 175 billion parameters, and GPT-4 may have over 1 trillion parameters.

But what exactly a model looks like?

(from Andrej Karpathy’s build a GPT model from scratch))

Model is just a file

A language model is just a binary file:

In above image, the parameters file is Meta’s Llama-2–70b model, and its size is 140GB which contains 70b parameters (in a format of digits). The run.c file is the inference program, which is used to query the model.

Training these super large models are very expensive. It costs millions of dollars to train a model like GPT-3.

As of today, the most outstanding model GPT-4 is no longer a single model, but a mixture of several models. Each model is trained or fine-tuned on a specific domain and work together to bring the best perform during inference.

But don’t worry, our goal is to understand and learn about the fundamentals of large language models. Fortunately, we can still train a model (with much fewer parameters) on our personal computers. We will dive in and code it together in later articles, step-by-step.

Model Inference

When the model is trained and ready, a user queries the model with a question, the question text is passed into that 140GB file and processed character-by-character then return the most relevant text as result outputs.

By meaning the most relevant, it means the model will return the text that are most likely to be the next character of the input text.

For example,

> Input: "I like to eat"
> Output: "apple"

apple is returned as the next character, because "apple" is the most likely next character of "I like to eat" - based on the large datasets which the model was trained on.

Remember we mentioned books, websites above? Base on the chunk of data we provide, you can think of this way:

I like to eat apple is a very common sentence that the model learned for multiple times.

I like to eat banana is also a common sentence but less common than the one above.

So during training, the model:

recorded apple with 0.375 probability to appear after I like to eat
recorded banana with 0.146 probability to appear after I like to eat
… other character probabilities …

So these probabilities result in probability sets saved in the parameters model file. (Probabilities usually called Weights in machine learning field.)

So basically, an LLM model is a probabilistic database that assigns a probability distribution to any given character and its relevant contextual characters.

This sounds impossible before. But since the paper 《Attention is all you need》 was published in 2017, the transformer architecture was introduced to enable such large-scaled contextual understanding possible by training a neural network on a very large dataset.

Model Architecture

Before the advent of the LLM, machine training on neural networks was indeed limited to using relatively small datasets and had minimal capacity for contextual understanding. This means that early models were not able to understand text in the same way that humans do.

When the paper was first published, it was intended for training language-translation purpose models. However, the team at OpenAI discovered that the transformer architecture was the crucial solution for character prediction. Once trained on the entirety of internet data, the model could potentially comprehend the context of any text and coherently complete any sentence, much like a human.

Below is a diagram showing what happens inside the model training process:

Of-course we don’t understand that at the first time we see it, but don’t worry we will soon walk through it in the following articles.

Before we delve into the programmatic aspects and mathematical details, let’s proceed to understand the concept and process of precisely how the model operates in next article.