What are LLMs? Understanding different LLM families

GPT, Llama, phi, Mistral and more

Published in

Data Science in your pocket

6 min readDec 13, 2023

I have been talking a lot about GenAI in my previous posts. Be it Langchain-based tutorials, AI-Agents, etc. But I have missed an important topic that is at the core of all this GenAI fuss i.e. different types of LLM families. You must have heard of Mistral, GPT, Llama, phi, Anthropic, etc and like me, you must be confused about how do these models differ and when to use which model.

My debut book “LangChain in your Pocket” is out now !!

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs

LangChain in your Pocket: Beginner's Guide to Building Generative AI Applications using LLMs eBook : Gupta, Mehul…

www.amazon.in

In this post, we will explore

What is an LLM? some important features of an LLM
Why do we need so many LLMs
Some important LLM families like GPTs, Llama, and others

But before we jump ahead, let’s take a step back and understand

What is an LLM?

An LLM is no black box but an ML model (based on Neural Networks) that predicts the ‘next’ token given a sequence of previously predicted tokens and input prompt.
How is it able to get the context of the input? Using multi-head attention helps in focusing on important words compared to other tokens in the input sentence. If you’re interested in mathematics, you can read the below blog.

Attention is all you need: understanding with example

‘Attention is all you need’ has been amongst the breakthrough papers that have just revolutionized the way research in…

medium.com

At any point in time, the LLM itself doesn’t know the final output as predictions are done one token at a time.
Such models can answer a variety of problems and aren’t restricted to a specific problem (say classification) because of Zero-Shot abilities i.e. generating results for a problem that the model has never seen.
LLMs are pretty huge (generally some GBs) and are trained on huge datasets.
Every LLM model follows the Transformers architecture explained in ‘Attention is all you need’ or a subset of it (like GPT which is just the decoder part of Transformer). So, if you know about Transformers, more or less you know every LLM’s architecture.
LLMs can be taken as ‘Generalists’ which are decent enough for any task but not Outstanding. Such a version of an LLM is called a ‘pre-trained’ LLM which is trained on a huge general corpus.
To make an LLM ‘Specialist’ for specific tasks, you need to ‘Fine-Tune’ it i.e. train a pre-trained model on your specific data.
Most of the LLMs come in various sizes like small, large, xl, xxl, etc. These sizes concern weight parameters present with the least parameters in small to maximum parameters in XXL.
Why would the size vary if the architecture remains the same? This can also be due to

Increased input size, hence leading to change in dimension of weights matrix and increased weight parameters.
Increased Attention matrix size (i.e. increased embedding dimension per token)
More increased Attention blocks or minor tweaks in architecture

You might have also heard Llama-70B or Llama-7B. What’s this number in the name? It signifies the number of parameters in that model. 70B means 70 Billion parameters. Some LLM families don’t follow the small, large, or XL terminology but mention parameter size in the name itself.
Exact information about the training data or architecture isn’t usually open-sourced for every LLM. Hence, we might have some blank voids while detailing some LLM families.
Loading LLMs in memory is usually a challenge due to their humongous size.
As LLMs are generative models, there is always an element of randomness in the final output. Hence, for the same input, it is highly unlikely you will get the same output.

One of my teammates asked me a very interesting question,

Why do we need so many LLMs?

To answer this, I will draw a reference from Machine Learning itself i.e. classification. Have you ever wondered why we have so many classification models? This is because

For different types of datasets, different models may come in handy and no one model will yield good results for all sorts of problems. Similar is the case with LLMs as well. Different LLMs hold expertise in different domains. Some may be great with coding tasks while others with NLP.
This difference in expertise is due to a) The datasets used for training these LLMs 2) the size of the corpus 3) Minor tweaks in the training process
You might not be able to use the best performing LLM always due to either size or cost, safety lacunae, or some other reasons. In such cases, you need to have alternatives. So depending upon your use case, you need to pick up the best model.
At times, you wish to accomplish an easy task that can be done by a smaller model rather than going for the big guns. So why to waste resources?
Due to all these reasons, we need to have multiple models

Now as we know about LLMs at its core, we will read about different popular LLM families and their significant members. Though there exist many variants of every LLM with different sizes, we would consider the size of the most significant model

GPT (GPT3.5, GPT3.5 Turbo & GPT4)

This is where it all started. ChatGPT is based on one of the variants of this family, GPT3.5. Let’s quickly talk about its core features

Architecture: GPTs are based on the decoder part of a Transformer, hence they have generative capabilities.
Parameters : GPT 3.5 = 175B, GPT4=1760B.
GPT3.5 Turbo is an enhanced and more optimized version of GPT3.5 with just ~20B parameters specific for NLP tasks.
Out of all these, only GPT4 is multi-modal i.e. it can handle different input formats like images, audio, etc.

Llama (Llama & Llama-2)

This model is Meta’s representation in the GenAI world.

Architecture: Unlike GPTs, Llama is based on complete Transformer architecture.
Parameter: Llama: 65B, Llama2-70B
A key feature of this family is a focus has been there on safety and security and has a lower violation rate compared to other models
Regarding performance, Llama-2, the best member, is at par with GPT3.5 but no better than GPT4.

Mistral (Mistral, Mixtral)

Mistral AI is a startup that came into the limelight with their 1st model, Mistral-7B which eventually beat some big names in the market in terms of performance that too with just 7B parameters.

Architecture: Similar to GPT, these models are based on the Decoder part of the Transformer
Parameters: Mistral: 7B, Mixtral: 84B.
Mixtral is called out to be a Mixture of Expert models (MoE). Also, though it has 84B parameters when loaded, it requires computing power for a model similar to 14B parameters.

What is a Mixture of Experts?

A Mixture of Experts (MoE) is a machine learning model architecture that combines the capabilities of multiple “expert” models to improve overall performance. The basic idea is to divide the input space into regions and assign different experts to handle each region. Each expert specializes in a subset of the input space and is responsible for making predictions within that region. A gating network is used to determine which expert is activated for a particular input.

Flan (Flan-T5, Flan-Alpaca)

Flan models are a little different from the above model families we discussed as they are Fine-Tuned LLMs using instruction fine-tuning.

What is Instruction Fine-Tuning?

In instruction fine-tuning, you need to provide an ‘instruction’ alongside the input and output while training the model. So, a general fine-tuning training sample for an LLM may look like this

Input : 'Write a story on animals and human'
Output : '........'

Input : 'Extract entities from this text : he is a boy'
Output :'......'

Now, instruction fine-tuning will look like this

Instruct: 'write a story on given topic'
Input : 'animals and human'
Output : '........'

Instruct : 'Extract entities from given text'
Input : 'He is a boy'
Output : '.......'

While Flan-T5 is a fine-tuned version of the T5 model by Google, Flan-Alpaca is a fine-tuned version of Llama.

Architecture: Transformer architectures
Parameters : Flan-T5-XXL →11B, Flan-Alpaca →11B

phi (phi-1,phi-1.5,phi-2)

phi by Microsoft is more focused on delivering good results but keeping the model size comparatively very small hence more production-friendly.

Architecture: They also follow transformer models at the core
Parameter : phi-1 & phi-1.5 →1.3B, phi-2 →2.7B
Phi-2 is said to have beaten models that are nearly 10x the size (say Llama-13B) and even ChatGPT for some tasks.

This list of LLMs is almost inexhaustible. Though I couldn’t cover, some important models one should know over the mentioned ones are Claude (similar results as GPT3.5 with ~130B parameters), Cohere, PaLM & T5(by Google), Falcon, etc. Explore them as well !!