Everything to learn about Large Language Models(Part 1/3- How do LLMs work?)

Keremaydin
7 min readNov 30, 2023

--

image generated by DALL-E

Large language models (LLMs) are a subset of deep learning and refer to large general-purpose language models that can be pre-trained and then fine-tuned for specific purposes. LLMs can be used for various tasks, including language translation, sentence completion, text classification, and question answering. They require minimal field training data when tailored to solve specific problems and can obtain decent performance even with little domain training data, making them suitable for few-shot or zero-shot scenarios. These series of articles go through every aspect of large language models. First part is dedicated to the inner workings of an LLM.

Introduction

Large language models fall within the generative artificial intelligence of the broader field of machine learning. The objective of generative AIs is extracting statistical patterns from an extensive dataset created by humans.

Interacting with large language models is very different from other machine learning approaches. We use human-written instructions called prompts to utilize LLMs without writing any code in contrast to other machine learning approaches.

ChatGPT logo

Today ChatGPT is the perfect example of a large language model application. As we know ChatGPT is a chatbot in which you can have a conversation with. Behind the curtains what ChatGPT does is actually next word prediction from previous words. Traditional language models do the same thing but they have faced some challenges such as predicting words without sufficient context, words with multiple meanings and the ambiguity of a sentence.

But how does the language models predict the next word from previous words?

Predicting the word ‘book’ from previous words(image by author)

Let’s dive into the heart of language models and why they are super effective today.

Traditional language models used to work with recurrent neural networks. Recurrent neural networks are neurons that generates an output with both the input and the previous output. This technique helped the RNN to imitate some kind of memory. However, this type of model failed, since it kept forgetting the beginning of a sequence if the sequence was too long. That’s why complex applications like ChatGPT could not be built with this type of architecture.

image from Analytics Vidhya[9]

This all changed when the transformers attacked. In 2017 the transformers model architecture was introduced in the paper ‘Attention is All you need’ by Vaswani. What made transformers architectures so effective was to learn the relevance and context of all of the words in a sentence. Not just to each word next to its neighbor but to every other word in a sentence. This is what separates self-attention from attention. Contrary to recurrent neural networks, transformers can process the entire sentence at once rather than one word at a time. This allowed parallelization in training which made training huge models like Large Language Models possible.

To understand the transformers architecture we have to go through the building blocks of the structure.

Building Blocks of a Transformer

1. Embeddings

We know that computers can only work with numbers and does not understand text. So before passing texts into the model to process, you must first tokenize the words. Tokenization is just partitioning text into smaller units. These units can be words, characters or subwords(character groups). Most of the times tokenization is done to subwords which are just group of characters, these character groups can be literal words, prefixes, suffixes or just multiple characters with no meaning. In a regular tokenization, there are approximately 1200 tokens in a text with 1000 words.

image by author

After tokenization it is time to give a numerical representation to tokens. The most popular representation is called embeddings. Embeddings are fixed sized vectors that captures the semantic and syntactic meaning of the input so the model can understand context. You can read my other article or watch Jay Alammar’s word2vec youtube video[6] to understand how the embedding vectors are calculated.

2. Positional Encoding

Greatest achievement of transformers architecture was the parallel processing of the input unlike the traditional methods. In traditional methods the input had to be processed in order therefore there was no room for parallelization. But with transformers the processing is parallel which means that each word is processed at the same time. So the positional information is lost. That’s why we join a positional encoding next to the word’s embedding vector to preserve the positional information.

3. Multi-head Self-Attention

Attention weights represent the relevance and importance of a particular word. In the case of self-attention weights, it measures the importance of a word in terms of another word. In other words, it calculates the relation between two distinct words.

image from Coursera course[1]

For example, the word ‘book ’ is more closely related to the word ‘teacher ’than the word ‘with’. Multi-head self-attention on the other hand, trains multiple self-attention weights and each self-attention weights represent a different aspect of the word relation such as syntactic, semantic, rhyme, etc.

4. Feed-forward Neural Network

Feed-forward neural network made of up multiple fully connected layers that calculates the non-linear or linear relationship between inputs and outputs. It has the most basic structure out of all the neural network architectures.

Feed-forward neural network

The whole process of a Transformer

Let’s go through the whole process, now that we understood the unique building structures of a transformer.

The architecture can be seen below:

image from Coursera[1]

The steps of transformers with illustrations:

  1. You first tokenize the words and these tokens are then encoded as numbers converted into embeddings. In the case below, the semantic information of the word is preserved at the embedding vector with a shape of [512, 1].
image from Coursera[1]

2. As you add the token vectors into the base of encoder and decoder, you also add positional encoding. The model processes each of the input tokens in parallel. So by adding the positional encoding, you preserve the information about the word order.

image from Coursera[1]

3. Once you have summed the input tokens and the positional encodings, you pass the resulting vectors to the self-attention layer. Here the model analyzes the relationships between the tokens in your input sequence. The self-attention weights that are learned during training and stored in these layers reflect the importance of each word in that input sequence to all other words in the sequence. The transformer architecture actually has multi-headed self-attention which means that multiple sets of self-attention weights are learned in parallel independently of each other. Each will learn different aspects of language.

image from Coursera[1]

4. Now that all of the attention weights have been applied to your input data, the output is processed through a fully connected feed-forward network.

image from Coursera[1]

5. The output of this layer is a vector of logits proportional to the probability score for each and every token in the tokenizer dictionary. You can then pass these logits to a final softmax layer, where they are normalized into a probability score for each word in the vocabulary.

image from Coursera[1]

6. The word with the highest probability is generated by the transformer. Then the generated input is forwarded to the decoder for the generation of the next word. This process is repeated until an end token is generated. Below you can see a language translation task being done by a transformer:

image from Coursera[1]

Conclusion

Following these steps, transformers can generate a sequence from a given sequence. And by fine-tuning these transformer models in a specific task, we can develop applications like chatbots that generates an answer to the user’s input. ChatGPT was built by just fine-tuning GPT(Generative Pre-trained Transformer) in the task of conversational AI. In the next article, we will examine how these large language models are pre-trained and fine-tuned.

References

[1] Generative AI with Large Language Models, DeepLearningAI, Coursera course

[2] All you need to know to Develop using Large Language Models by Sergei Savvov, https://towardsdatascience.com/all-you-need-to-know-to-develop-using-large-language-models-5c45708156bc

[3] Mastering Large Language Models: A 7-Step Learning Journey, Youssef Hosni, https://levelup.gitconnected.com/mastering-large-language-models-a-7-step-learning-journey-0135a4dc822d

[4] Mastering AI Jargon — Your Guide to OpenAI & LLM Terms, https://www.youtube.com/watch?v=q4G6X09NEu4

[5] Natural Language Processing and Large Language Models by Serrano.Academy, https://www.youtube.com/playlist?list=PLs8w1Cdi-zvYskDS2icIItfZgxclApVLv

[6] The Illustrated Word2vec — A Gentle Intro to Word Embeddings in Machine Learning by Jay Alammar, https://www.youtube.com/watch?v=ISPId9Lhc1g

[7] What is a large language model (LLM)? https://www.elastic.co/what-is/large-language-models

[8] What are Large Language Models, https://machinelearningmastery.com/what-are-large-language-models/

[9] A Brief Overview Recurrent Neural Networks, https://www.analyticsvidhya.com/blog/2022/03/a-brief-overview-of-recurrent-neural-networks-rnn/

--

--