Transformers(simplified) in the context of LLM’s! — 2024

Parichay Pothepalli
6 min readJan 9, 2024

--

Transformers

If the phrase “Attention is all you need!” doesn’t ring a bell, chances are high that you aren’t doing it right by Data Science. However, today we shall break the myth of complex Transformers and attention in the context of Large Language Models.

With the boom of Large Language Models, I wanted to focus this article on three major components:

  1. Why did the transformers come into existence?
  2. Why are they more useful than traditional generative AI approaches ( before 2017)?
  3. Modern day use cases for Transformers in Large Language Models ( includes ChatGPT, Bard, TNA-5).

I. Why Did the Transformers come into existence?

Transformers are based on the idea that, paying attention to context in natural language is extremely important. Before 2017 and till date for some of the use cases Recurrent Neural Networks ( bi-directional context) are very well known to support Generative language.

Generative AI: It is a term used when we use various Machine Learning/ Neural Networks on multiple modalities (Vision, Language, Videos) to generate content given a small to large piece of the data.

In this article we shall focus on Generating text based on a given prompt/ input text. RNN’s have been useful to predict a word given a prior sequence of words.

RNN Next word prediction

The above image is a basic representation of a model generating multiple responses as the “next word prediction” given a set of words that precede it. The model (RNN in this case) uses a probabilistic model to capture the next possible word and outputs the word with the highest probability.

Challenges:

  1. The problem of context and only being able to predict based on the immediate preceding words rather than the hidden meaning in the whole sentence.
  2. Scaling resources for compute can be a huge task as RNN’s can take up a lot of compute which can get very expensive with scaling.
  3. Not enough input is provided to the model to train on and thus results in not so great output.

Thus comes the concept of Attention and the evolution of Transformers.

II. Transformers as a Generative AI solution:

Transformers are a deep learning architecture that are based on the mechanism of multi-headed self attention.

Multi-headed self attention:

It is the ability of the model to learn the relevance of each word in a sentence to a particular word and not just the preceding words. When we apply attention waves we can understand the magnitude of the relationship between words at different positions in a sentence. It is multi-headed in nature because the model generates multiple sequences of attention waves during the learning phase.

Multi-headed self attention map

Eg:- One head may learn about the magnitude of rhyming, one may learn about if the words are rhyming, one may look at phonotics and so on..

Step 1: Convert text to Tokens/Id’s

We convert text to machine readable form ( numbers) and pass it to the model as the initial input.

Input for the Transformer ( Encoder end)

Step 2: Convert each token ID into a vector representation:

We then send this input to the embedding layer. Which basically allows us to represent each token ID as a multi-dimensional vector in an n-dimensional space.

Embedding layer representation

Step -3: We add a Positional encoding to give structural understanding:

These vectors are then again encoded on the basis of the position of the word in the sentence. So now we have two encodings, one is Token Embedding and the other is Positional Embedding.

Step -4: Feed embeddings into the Feed Forward layer:

These embeddings are sent to a feedforward layer for the final softmax layer output. We receive a probability score for each token in the corpus and one with the highest probability is predicted as part of the completion.

Transformer Architecture

Step-5: Feed the final output from Encoder to the Decoder:

We send as input a deep representation of the input text from the encoder end. We then send a start of sequence token as an input to the decoder layer. The layer then uses the context provided by the encoder to provide the predicition for the next word in a sentence.

Repeat on the decoder end to produce a complete sentence of prediction.

In Summary, an Encoder: Inputs the prompts with a contextual understanding and produces on vector per input token and a Decoder: Accepts input tokens and generates new tokens.

III. Modern day use cases and types of Transformer Architectures:

Sequence to Sequence Models [ BART]

The architecture we discussed above mimics the sequence to sequence modeling for Transformers.

Seq to Seq model

The focus is on taking and mapping sequences to each other. This makes them ideal for translation-related tasks. These tasks don’t fare all that well for Text Generation tasks.

Auto-Regressive Models [GPT-3/3.5]

These models leverage the prior tokens/words to predict the next token iteratively. They use probabilistic inference to generate text, relying heavily on the decoder component of the transformer. These models do not require an explicit input sequence and thus are very popular for today’s tasks of text generation.

A problem which is faced is that the model doesn't really have to understand the underlying text to produce results. It produces the output that suits best structurally, positionally and mathematically which most times leads to Hallucinations. Producing unethical responses/ incorrect facts are some of the outcomes one may face if not fine-tuned using Retrieval Augmented Generation (Shall discuss in the next article).

Auto-Encoding Models [BERT]

These models tend to be more robust in nature but are less employed in today’s world. They add masking in order to regularly corrupt the training data to ensure there is no overfitting. Being bi-directional in nature we understand the context from the front as well the back side of the sentence hence captures relationships and dependencies better amongst words.

Auto-encoding models are primarily used for Natural Language Understanding and Classification models.

Types of Transformer architectures

Sequence-to-sequence models excel at mapping sequences between languages, autoregressive models are powerful for text generation, and auto-encoding models focus on language understanding and classification.

This article gives us the base/ foundation to understand Large Language Models, thier working and fine tuning them for organizational use cases.

--

--

Parichay Pothepalli

Data Scientist who believes that AI/ Deep Learning is here to change the way we view the world. Join me in this journey of growth, sharing experiences.