Unveiling the Power of Large Language Models: the Basics

5 min readNov 5, 2023

As I’m starting my journey in exploring Large Language Models (LLMs), I decided to share it with you in this series: “Unveiling the Power of Language Models” to further motivate myself as one of my values is community & knowledge sharing.

This series entails my learnings about LLM as I’m currently working on a project in that area.

Each week, I’m going to post an article with my learnings.

First, let’s explore the basics.

When we talk about Large Language Models (LLM) although most people might already know about them, I’d like to go back to the origins of these Language Models, as these models aren’t new, it happens that they gained visibility with the appearance of ChatGPT which is one of the most recent achievements in the area of Language Models.

1) Language Models:

From Wikipedia, “a language model is a probabilistic model of a natural language[1] that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on.”

So these models rely on probability & training data composed of a series of words in texts.

These language models appeared first in 1980, and are useful for a variety of tasks, including:

speech recognition (helping prevent predictions of low-probability (e.g. nonsense) sequences),
machine translation, natural language generation (generating more human-like text),
optical character recognition,
handwriting recognition,
information retrieval.

The methods behind language models could be split into 2 main categories:

1- Pure probabilistic models — (n-gram words[1])

In a simplified explanation, it’s about predicting the probability of a word's appearance based on the (n-1) words before.

2- Neural Models:

Neural network-based language models use word embeddings [2] to ease the sparsity [3] problem.

Neural networks are a set of algorithms, modeled loosely after the human brain, that are designed to recognize patterns. They interpret sensory data through a kind of machine perception, labeling or clustering raw input. The patterns they recognize are numerical, contained in vectors, into which all real-world data, be it images, sound, text or time series, must be translated. [4]

Main Limitations

The limitations faced while developing and improving language models were: the Sparsity Problem & the Context Problem.

Next, I define those problems.

1- Sparsity Problem in Language Models

So let’s use the power of LLM and ask ChatGPT to explain the sparsity problem to a child. Here is the answer:

Imagine you have a big puzzle with many pieces, and you want to put them all together to see the whole picture. But what if some pieces are missing? That’s what sparsity is like in language models. When we try to understand or generate sentences, words are like puzzle pieces. If some words are missing or very rare, it’s hard for the model to understand what they mean or how they fit in. So, the sparsity problem in language models is like having some puzzle pieces missing, making it tricky for the model to understand everything.

2- Context Problem in Language Models

Imagine you’re telling a story to a friend, but you forget to mention some important details. They might get a little confused, right? Well, the context problem in language models is similar. When they try to understand what we’re saying or writing, they need to know what happened before, just like we need to know what happened earlier in a story to understand what’s happening now. If they don’t have enough context, it’s like missing parts of the story, and it can be hard for them to understand what we mean.

2) Development of Language Models

The Neural network-based Language Models addressed the sparsity problem, but the context problem remained.

Over the past decades, we then witnessed the appearance of Recurrent Neural Networks (RNNs), where specifically LSTM (Long Short-Term Memory) contributed to the evolution of Language Models ability.

These models were capable of capturing sequential relationships in linguistic data and producing coherent output.

However, the main drawback of these models was that they were slow to produce results due to their sequential nature, which translated into a long training time.

That’s where later, GPT models of OpenAI and BERT came into place to address such a challenge, using the transformer models.

Here is a transformer’s definition based on Wikipedia:

A transformer is a deep learning architecture, initially proposed in 2017, that relies on the parallel multi-head attention mechanism. and its later variation has been prevalently adopted for training large language models on large (language) datasets, such as the Wikipedia corpus and Common Crawl, by virtue of the parallelized processing of input sequence.

So a transformer uses parallelization which means less training time, and a mechanism called attention [5]. This mechanism of attention handles the context challenge mentioned earlier.

Here is a great explanation video of how transformers work.

Timeline of natural language processing models — Wikipedia

This takes us to Large Language Models.

3) Large Language Models

In a nutshell, Large Language Models (LLMs) allow computers to understand and generate text better than ever before. An LLM is a type of Language Model capable of processing large datasets.

A large language model (LLM) is a type of language model notable for its ability to achieve general-purpose language understanding and generation. LLMs acquire these abilities by using massive amounts of data to learn billions of parameters during training and consuming large computational resources during their training and operation.[1] LLMs are artificial neural networks (mainly transformers [2]) and are (pre-)trained using self-supervised learning and semi-supervised learning. [Wikipedia]

LLM is used in OpenAI’s GPT models (e.g., GPT-3.5 and GPT-4, used in ChatGPT), Google’s PaLM (used in Bard), and Meta’s LLaMa, as well as BLOOM, Ernie 3.0 Titan, and Anthropic’s Claude 2.

To Understand further definitions involved in LLM work, like tokens, encoder, and decoder, refer to this article How Transformers Work giving a detailed overview of how it works step by step.

I hope this gave you a simplified overview of the basics around Large Language Models. In my next article, I will talk about one of the areas LLM can be applied, which is summarization.

That’s all for this article, thanks for reading!

References:

[1] What’s an n-gram?

[2] Word Embedding and Word2Vec, Clearly Explained!!!

[3] What is Sparsity? explained in a video.

[4] Neural Network Definition by Pathmind Wiki.

[5] Attention is all you need.

More references used in this article:

A beginner’s guide to Language Models

Evolution of Neural Networks to Large Language Models in Detail

What are Large Language Models?