An easy-to-understand explanation of how large language models (LLMs) work and the different types of LLMs!

Yuma Ueno
The Deep Hub
Published in
6 min readJan 30, 2024

Hello! I’m Yuma Ueno(https://twitter.com/stat_biz), from Japan. I’m into AI Industry, and runnning my own small company about AI.

The emergence of various generative AIs(ChatGPT, Stable Diffusion and so on) has led to an AI boom.

Among these, the generative AI models trained on large text, such as ChatGPT, are attracting attention as large language models (LLMs).

In this article, I would like to explain such large language models (LLMs)!

What is a Large Language Model (LLM)? How does it work?

First, what kind of model is the Large Language Model (LLM)? Let’s take a look at how it works!

The Large Language Model (LLM) is booming now, but it is itself only a derivative of the field of natural language processing, which has existed for a long time.

For a long time, there has been a lot of research on how to make machines understand human language and apply it to various tasks such as translation and language generation.

In 2017, a breakthrough with the introduction of the Transformer model and improvements in computing machine power made it possible to train the model on a large amount of data and significantly increase its accuracy, giving rise to the large language model (LLM).

Transformer is a deep learning model published in 2017 in the paper “Attention Is All You Need”.

The following is a citation from Attention Is All You Need.

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
(Google- “Attention Is All You Need”)

What this paper explains about is that the accuracy has improved a lot by using only the Attention layer instead of the Recurrent and Convolutional layers!

“Attention Is All You Need” means, as the saying goes, that only the Attention layer is necessary!

Transformer as a base has led to the emergence of new models such as BERT and GPT.

Now, then, what are the different types of such large language models (LLMs)? Let’s take a look!

BERT

BERT is a model released by Google in October 2018 and is described in the paper as such

BERT is conceptually simple and empirically powerful. it obtains new state-of-the-art results on eleven natural language processing tasks
(Citation: Google-”BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding”)

BERT is simple, yet very powerful, and has achieved superior results on eleven common language processing metrics…

In fact, BERT is a pre-trained model that works best when combined with other existing models.

BERT is “Bidirectional Encoder Representations from Transformers.”

BERT is now able to learn context from both directions.

GPT model

Then the GPT model, made famous by ChatGPT!

What kind of algorithm is the GPT model?

Let’s take a brief look.

The GPT paper was published by OpenAI in 2018.

Improving Language Understanding by Generative Pre-Training

This paper shows below.

we explore a semi-supervised approach for language understanding tasks using a combination of unsupervised pre-training and supervised fine-tuning. Our goal is to learn a universal representation that transfers with little adaptation to a wide range of tasks.
Citation:“Improving Language Understanding by Generative Pre-Training”

GPT is a two-stage model in which large text data is pre-trained, followed by learning appropriate for each task, called fine tuning.

The first pre-trained consists of the following

After the pre-trained with the decoder described earlier, the second step, fine-tuning for each task, is performed.

While previous models like transformer require a large amount of labeled data to be input to the encoder-decoder model, GPT enables learning of large data sets by first performing unsupervised learning without labeling in a pre-trained model and then fine-tuning the model for a specific task.

I explained about GPT model in detail below article.

PaLM

PaLM is a model published by Google in 2022 and stands for “Pathways Language Model”.

The paper is below.

Scaling Language Modeling with Pathways.

The paper is a huge work, 87 pages long, but to be honest, it does not present a particularly innovative architecture, but is based on the Transformer architecture also announced by Google in 2017.

PaLM, to put it simply, is a model with a large number of parameters that is trained on a large amount of data in the most powerful machine configuration!

It has 540 billion parameters.

The OpenAI GPT-3 announced before PaLM has 170 billion parameters.

And the Megatron-Turing NLG announced by Microsoft has 530 billion parameters.

The Megatron-Turing NLG also has exquisitely 10 billion parameters, which shows Google’s determination.

Properly, the paper politely describes how awesome PaLM is compared to other large language models (LLMs).

LLaMA

LLaMA stands for (Large Language Model Meta AI) and is a large language model published by Meta in February 2023.

The following is the LLaMA paper.

LLaMA: Open and Efficient Foundation Language Models

LLaMA is a model with far fewer parameters than the GPT model developed by OpenAI or the PaLM model developed by Google.

LLaMA achieves high accuracy while keeping the number of parameters low, making it possible for researchers around the world to explore the possibilities of developing new various large language models based on LLaMA.

The LLaMA code is up on Github and is available as open source.

LLaMA code repository

LLaMA features a small number of parameters.

The number of parameters for each model is as follows

(Source: LLaMA: Open and Efficient Foundation Language Models)

LLaMA contains respectively

7 billion (6.7 billion to be exact)
13 billion
33 billion (32.5 billion to be exact)
65 billion (65.2 billion to be exact)

This may seem like a large number of parameters, but compared to the 170 billion parameters in GPT-3 or the 540 billion parameters in Google’s PaLM, it is by far the smallest!

The amazing thing about LLaMA is that it outputs accuracy comparable to other models even with a small number of parameters.

There are many other large language models (LLMs), and their trajectories are summarized in detail in the following paper!

A Survey of Large Language Models

So far, I have summarized LLM models!

Please clap and comment, follow if you like it!

See you next time!

--

--