A Brief Guide To LLM Numbers: Parameter Count vs. Training Size

Greg Broadhead
7 min readAug 25, 2023

Ever wondered how AI systems effortlessly understand and spawn natural language texts, regardless of the input and context? How do they craft answers to your burning questions, draft your emails, or even generate code? The magic wand behind this is the Large Language Models (LLMs) fuelling these systems.

LLMs are incredibly complex machine learning behemoths that have an innate ability for processing and churning out incredibly human-like text. Their prowess stems from extensive training on a vast corpus of text data gathered across many different domains and trained to identify how patterns within that text create context and meaning.

LLM’s are typically defined by parameter count and training size. In this article we will dive into the differences between these numbers and how they influence the capabilities of a model.

Decoding the Numbers Behind LLMs

In the ever-evolving realm of natural language processing (NLP), you might have stumbled upon acronyms like GPT-4, BLOOM, or LLaMA, often followed by intriguing numbers such as 175B, 60B, down to 7B. These aren’t just random combinations of numbers and letters; they refer to the size and often, the capability of Transformer based LLMs.

LLMs are the current powerhouse of artificial intelligence (AI) systems that almost magically comprehend and craft natural language text.

But what story do these numbers tell, and why should we care? Let’s dive and unravel what these figures symbolize, their link to the tokens used during training, and their influence on an LLM’s vocabulary size, creativity and raw capabilities.

Understanding the Numbers Behind LLMs

The numbers trailing the names of open-source LLMs denote the model’s parameters. Think of parameters as the conductors orchestrating how the model manipulates and understands the input data and produces outputs. They could manifest as weights or biases, influencing the significance of specific input features on the generated output.

A larger parameter count generally equates to a model with increased complexity and adaptability (although not strictly true across different architectures, generally true within a transformer architecture). A large language model with a higher parameter count can discern more intricate patterns from the data, paving the way for richer and more precise outputs. But, as with many things in life, there’s a trade-off. A surge in parameters means higher computational demands, greater memory needs, and a looming risk of over-fitting.

Training Data, Parameters, and Variations Across Models

The Training Ground for LLMs

The most common and recent Large Language Models owe their adeptness to vast libraries of text data, primarily hailing from the Internet. By predicting the subsequent word or token from their training, they generate lifelike language patterns and linguistic intricacies. The foundational knowledge gained from this training data allows them to be further fine-tuned towards specialized tasks or domains, transforming them into chatbots, summarizers, translators, or even coders.

Deciphering LLM Parameters

But what defines the proficiency and versatility of LLMs? The heart of the matter lies in the parameters of LLMs which establish a blueprint the model’s architecture and base capabilities. LLM parameters are the processing guideposts that establish the model’s transformation of input data to output. These parameters, whether weights or biases, govern the impact of specific input features on the resultant output.

The complexity and expressiveness of an LLM increases with an increase in parameters. While this empowers the model to discern a broader spectrum of patterns, it’s a double-edged sword. The flip-side? increased computational demands, an increase in memory requirements, and the ever-present spectre of model overfitting.

The Many Faces of LLM Parameters

LLM parameters aren’t a one-size-fits-all. They morph across models, influenced by the model’s size, architectural blueprint, and the data that shaped it. The model’s magnitude is often gauged by its parameter count. For instance, OpenAI’s GPT-3 boasts nearly 175 billion parameters (nearly 45 Terrabytes of raw text data), BLOOM stands tall with 176 billion parameters, and Meta’s LLaMA offers the choice of four sizes: 7B, 13B, 33B, and 65B parameters.

While the majority of LLMs find their roots in the Transformer architecture, replete with layers of attention and feed-forward networks, their individuality shines in the specific attention mechanisms they harness, whether it’s sparse attention, global-local attention, or nuanced self-attention.

Training Data and Influence of Parameters: Tokens

The Data Behind the Model

The training data for a model isn’t merely about quantity but also quality and diversity. While some models, like GPT-4, are nurtured on expansive datasets enriched with human feedback and adversarial testing, others like BLOOM are groomed on meticulously curated datasets that deliberately omit high-toxicity sequences. Yet others, like LLaMA, benefit from a composite of public data infused with top-tier annotations.

The Ripple Effect of Parameters on Input and Output

The parameters of an LLM are instrumental in shaping its input and output dynamics, encompassing aspects of quality, diversity, and reliability. Here’s a breakdown:

  • Quality: Refers to the precision, relevance, and coherence of the generated content.
  • Diversity: Speaks to the variety, originality, and creativity of the content.
  • Reliability: Addresses the consistency, robustness, and accuracy of the content.

In general, models with high parameter count will produce content of superior quality and diversity. Their vast parameter set equips them to learn and predict more comprehensively. Yet, it’s essential to understand that more isn’t always better. In specific contexts, a leaner model, optimized for a particular domain, might outperform its bulkier general knowledge counterpart.

Similarly, models with a richer parameter tapestry can better navigate diverse and intricate inputs. Their large parameter set grants them the ability to recognize an array of tokens and features. Yet again, it’s not a universal rule. At times, a model with lesser parameters but fine-tuned to resonate with human values might be the underdog, surpassing its bigger peers.

Ultimately the parameters determine the weight, or influence that specific tokens have on creating semantic links, context and probabilities used to generate the internal representation of the user input on the resulting output.

Vocabulary Size and a Deep Dive into Parameters and Training data

Parameters and Tokens: Unravelling the Connection

While parameters offer a glimpse into an LLM’s complexity, tokens provide a window into its breadth of knowledge. Contrary to some beliefs, these two aren’t directly linked. Tokens, which can range from words, subwords, characters to symbols, signify the chunks of text the model processes. The sheer number of tokens a model trains on stands testament to its exposure — more tokens mean a more worldly-wise model.

However, a huge token count isn’t a silver bullet. It drags along challenges like protracted data collection, ethical conundrums, and a plateauing performance curve.

The Vocabulary Size: A Linguistic Toolbox

The vocabulary of an LLM serves as its linguistic toolbox — the set of unique tokens it recognizes and wields. A model’s vocabulary richness often hinges on its parameters, weight and biases, and token training count. Although more parameters and tokens generally produce a more extensive vocabulary, this expansion isn’t without its pitfalls, such as increased storage demands and computational costs.

Inside the World of LLM Parameters

Parameters, such as attention weights and embedding vectors, are the unsung heroes behind an LLM’s prowess. The attention mechanism, for instance, empowers a model to selectively zoom into pivotal segments of input, sidelining the extraneous bits. An attention weight, a crucial parameter, dictates this selective focus. In the realm of translations or summaries, these weights become invaluable.

Embedding vectors, another set of parameters, transmute textual tokens into numerical avatars, encapsulating their essence. These correlated chunks of information can be placed directly within a model, or can exist in an external data source known as a Vector DB. For example, when a model reads a PDF from the internet or a local file, it is first using machine learning to extract the information as text through Optical Character Recognition (which is a type of machine vision), then taking that text and converting it into smaller chunks of data which is then placed into a vector database that can be searched against and utilized as knowledge when a model is asked a question. Another example is in language translating, these embedding vectors bridge the linguistic gap, encoding tokens from the source language and decoding them in the target language.

Wrapping Up the World of LLMs

The mystique behind the numbers trailing open-source LLMs, such as GPT-4, BLOOM, or LLaMA, is richly layered. These numbers, indicative of parameters, cast light on a model’s complexity and expressive capabilities. While they might dwarf other entities, like smartphones or Wikipedia articles, in their magnitude, they aren’t the zenith in the AI model universe.

The parameters of an LLM are the linchpins, shaping its ability to process and regurgitate natural language text in diverse contexts. Whether it’s the attention weights that determine the model’s focus or the embedding vectors that translate tokens into meaningful numerical representations, these parameters are the cogs in the vast machinery of LLMs.

The number of tokens used to train the base model define the inherent knowledge a model possesses, whether that be its knowledge of medicine, or knowledge of the linguistic style that a person associated with a corpus of text; which is how you can get an LLM to respond in the style of a famous person, like Sir Michael Caine.

In the grand scheme of AI and NLP, understanding these numbers and their significance is akin to possessing the Rosetta Stone, offering insights into the intricate world of Large Language Models and their differing capabilities.

--

--

Greg Broadhead

"AI and Data Consultant for Fortune 500s. Working to demystify AI through insightful and creative articles."