Large Language Models 101

This guide does not propose new insights into LLM architecture, nor does it provide a deep dive into the tuning adjustments made on LM hyperparameters. It doesn’t assume you know the current Chinchilla paper ratios of parameters to data input, nor does it assume you subscribe to the latest episode of AI Daily. (However, if you would like daily updates to the latest and greatest happenings in the world of LLMs, be sure to check out The Sequence for a daily email on what is happening in the world of LLMs). You want straight facts, and you want them in the most distilled medium (pun intended) possible.

At a high level, Large Language Models give users the ability to generate human-readable text from input entered by a user. While there are myriad details for LLMs, we will stick to the important points:

  • Model architecture
  • Training Data
  • Storing inputs
  • Inference
  • Use cases

Model Architecture

At a high level, LLM models attempt to predict a word or sequence of words, given the previous words it has seen. This is referred to as “token prediction,” where each word is represented as a token. To scale this idea out further, tokens can also be representations of many words, known as n-grams. A 1-gram token could be one word like “new,” whereas a two-gram token could be a word like “New York,” and a three-gram token could be “New York City.” Over the past decades, various methodologies have been revealed in mathematical and computational thinking that allow for various ways to analyze and predict new words using input tokens. Phrases like “transformers,” “attention,” and “masking” are all just fancy methods that researchers have spent countless hours on to improve the next token prediction method.

Parameters in a model represent the connections that the model makes when learning new material. On a high level, this can be seen as similar to how the human brain functions, albeit a loose analogy. These parameters are given a random weight at the beginning of training. Each time the model iterates over new training data, it forms stronger connections, known as adjusting the weights of the parameters. More parameters equate to a more advanced model, requiring more time for each connection to reach an optimal balance with the other parameters. Once the model is trained, these parameters are “frozen” and stored in memory, entering the pipeline for predictions to be made based on input data, otherwise known as “inference.”

Image made with MidJourney

There has been exciting advancement over the past several years in the natural language processing (NLP) domain. In 2019, an LLM called BERT (Bidirectional Encoder Representations from Transformers) was released to great fanfare. This was the first truly “approachable” LLM that was fine-tunable and quick to provide results (inference), based on a 2017 paper titled “Attention Is All You Need.” Most notable was that BERT was trained on 64 TPUs over 4 days and has resulted in 23 smaller versions of BERT that reach 95% of performance and run 60% faster. Rightly so, HuggingFace (a popular open-source LLM framework) calls BERT the “Swiss army knife solution” to most common NLP tasks.

In 2020, T5 (Text-to-Text Transform Transformer) was released. This model was trained with 11 billion parameters and built upon the architecture of BERT, albeit with some small improvements that we won’t dive into here (but if you want, check out encoder-decoder architectures). While more updated versions of T5 have been released, the original architecture is very much still in use today.

In 2019, OpenAI released GPT-1 and GPT-2. This was a transformative moment due to the size of the data and the 1.5 billion parameters used. Then in 2020, GPT-3 was unveiled. This monolith was 100 times larger than its predecessor (GPT-2) at 175 billion parameters and exhibited unprecedented few-shot and zero-shot learning. In all, GPT-3 was trained on 300 billion tokens ( more on this in a moment). GPT-4 was released in 2023 and has surpassed the capacity that was set by GPT-3.

Models you may have heard of:

  • LLaMA by Meta AI (open source)
  • PaLM by Google (closed)
  • GPT-4 by OpenAI (closed)
  • Alpaca by Stanford (open source)
source

Training Data

And what about the data these models are trained on? Large language models (LLMs) like GPT-3 are typically trained on large amounts of text data collected from the internet. The exact details of the training data used for specific models may vary, but here are some common sources:

  • Web Pages
  • Books and Literature
  • Wikipedia
  • News Articles
  • Other Text Sources (think places like Reddit)
  • Code repositories (think GitHub)

A recent article by The Washington Post has a phenomenal breakdown of the 15 million websites that have been used to train LLMs, which include Google’s T5 and Facebook’s LLaMA.

Source

The article does a superb job of visually highlighting where the data for some of the most popular LLMs come from and the percentage of each area that certain popular public information sources represent in that dataset.

Source

Now there is an important caveat here. LLMs are typically unfiltered, and the models learn from the patterns and information present in the data. Not unfiltered like #unfiltered on Instagram, but unfiltered where the raw data going into the model is not curated to remove specific data points that are not representative of the world. This means that the model results could reflect any biases that are in the data. For example, part of the data used to train ChatGPT came from Reddit, where 67 percent of users are men, and 64 percent of those are between the ages of 18 and 29. Furthermore, ChatGPT is also trained using Wikipedia, which only has 15 percent of its authorship identifying as female.

source

Storing inputs

Now that we have a model, how do we process data through it? This is where embeddings come into play. When you input data for the model to train on, the model can only read numbers to make sense of the world. In light of this, embeddings are a mathematical algorithm that allows for individual and strings of text to be represented as numbers between 0 and 1. These numbers are put into a semantic space in the form of vectors. Words or strings of text that have similarities are close to each other in this space. See the image below where each dot represents a token/string in semantic n-dimensional space.

Words or phrases that are not related farther away. See image below on the far right for a visual comparison of text embeddings as clusters

To store this data, we use a vector database. The vector DB is a means to store these vector embeddings that can be queried later. To query the semantic space, another algorithm is used for the search capability. Most notable is the Approximate Nearest Neighbors (ANN) algorithm. This algorithm is a traditional machine learning (ML) clustering algorithm that allows for an approximation of the embedded vectors' location to save time. If the algo needed to query all possible matches EXACTLY, then time would be a major issue.

For more information on vector databases, please refer to this article by Pinecone.

Tools you may have heard of:

  • ChromaDB (open source)
  • Weaviate (open source / paid, runs in containers — Docker/Kubernetes)
  • AWS Zilliz Cloud (third-party tool within AWS Marketplace)
  • GCP Matching Engine (managed service)
  • Pinecone (paid)
  • Milvus (paid)
Image made with MidJourney

Inference

One will frequently hear the term “shots” when performing and evaluating LLM performance. Shots refer to the number of times that a user needs to guide the LLM to reach a level of acceptance in its response. The more times the user needs to intervene, the more shots that the model is said to perform. If there is no prompting by the user, then the model can generate responses without any tuning. Few shot learning would indicate that the model needed to go back and forth with the user to get the desired results. Let’s keep in mind that every time we do a “shot,” the model needs to store that information in the vector DB for future retrieval.

Moving on, there are phrases like “prompt engineering” and “fine-tuning.” Prompt engineering simply means that you are changing the prompt to be more specific to what you want to see. By tailoring one’s wording to a more specific representation of what the user wants to get out of the response, or even using multiple “shots” at getting a response and querying based on that response, the user can create a more customized LLM experience. On the other hand, hyperparameter tuning allows more engineer-focused users to change the proverbial dials on the LLM. This can be used to change how the LLM retrieves vectors from the vector DB or returns text based on how exact the match needs to be.

What is happening under the hood with our friend the vector DB? Like any traditional ML process, when the data inputs go to the model for inference, a data transformation needs to take place. Similar to how the training data was transformed for model training, the text inputs also need to be embedded to compare to the vector DB for similarity matches. This is where the ANN, or similar model, enters the picture. That ANN model looks for the most similar comparison (comp) to what its user input embeddings represent. Once found, it retrieves those embeddings and begins the prediction process with the model weights. These weights are numbers that represent the various parameters that the model has and are often obfuscated by the model vendor if not open-source. Once the model has determined a predicted probability for the next token in the sequence, it generates a response. This is why you will find prompts to appear like they are being written on a typewriter in front of you. The model is literally deciding what text to put next based on a predicted probability and a set of model rules!

Use Cases

LLMs have a wide variety of uses. From text summarization to text generation to enhanced chatbots, LLMs are capable of both traditional NLP tasks, and more modern areas. The more modern areas we see LLMs in use are in the area of content generation. For this article, ChatGPT was used to ideate, but not to directly pull text. It is very possible that I could have used the ChatGPT LLM to cut and paste content. For example, if I asked ChatGPT to tell me how a vector database worked, but prompt engineered it to give me high-level bullets, this is what I would get:

With more prompt engineering, I could very well create output from ChatGPT (or another LLM of choice) that would mirror the language that I was using to have a more natural flow. In fact, I could even go as far as to use a model, such as AWS’s Flan-T5 , GCP’s PaLM 2.0, or an open-source option like PrivateGPT. This is an area referred to as “Foundation Models” where the LLM is trained on a large corpus of data, then transferred to be trained on a specific domain, like my personal style of writing in this article. The LLM would store the foundation model weights and train a layer, stored in the vector database, that could more coherently replicate my tone and style!

Limitations

There are limitations to use cases that should also be kept in mind. Primarily, a term called “hallucination.” This refers to the LLM's ability to make up answers that don’t exist. One of the main drawbacks to LLMs is out-of-the-box responses that can be made up if those input data points have not been in the training data and in the vector representations that are stored in the vector DB. Without LLM engineers tuning the model to the use case, the model results could have dire consequences.

Another key limitation is the size of the input prompts. ChatGPT-3 can have 4,096 tokens (words), but anything more is ignored by the model. This becomes an issue when using multiple shots in the prompt engineering process. Since the model itself has difficulty storing values, we rely on chained prompting that allows the user to break up complex prompts into multiple intermediate steps. Most notably with the use of LangChain, these steps can be chained together to form a living memory for the LLM. IN turn, this allows for more efficient memory as the LLM can leverage previous user inputs for more refined responses.

Future Discussions

Image made with MidJourney

The ethical and legal dilemmas that LLMs face in the Gen AI community are for another post. While there are tremendous areas for business growth irrespective of industry, there are undefined solutions for how to deal with this newer implementation of NLP tech. Namely, these areas include issues of data bias, attribution of work, disinformation uses, authentication of ownership, siloing of tech, security, model explainability, and sensationalism of product breakthroughs.

New Horizons

Image made with MidJourney

LLMs have a storied history as they build upon the foundations laid before them. From probabilistic and statistical models of the 1990s to neural networks in the 2000s to more robust neural networks leveraging stronger GPUs and advanced architectures in the 2010s, generative AI has brought a new era, if not a paradigm shift, to the advancements in NLP. We continue to see advancements in training data, new benchmarks in the graphical processing unit (GPU) capacity, refined models, and innovative infrastructure to support novel business use cases.

Author’s note: A very special appreciation to the numerous team members at Eviden who helped review and comment on this post!

--

--

Nicholas Beaudoin
Eviden Data Science and Engineering Community

Nicholas is an accomplished data scientist with 10 years in federal and commercial consulting practice. He specializes in ML operations (MLOps).