Llama 3.1: Everything You Need to Know About Meta’s Latest AI-Language Model

Abhishek Selokar
8 min readAug 5, 2024

--

Meta’s Llama 3.1 is here, and it’s revolutionizing the AI landscape. If you’ve been curious about the latest advancements in natural language processing, Llama 3.1 is a name you’ll want to remember. Built on the success of its predecessors, this cutting-edge language model is pushing the boundaries of what AI can do, from generating human-like text to assisting in complex research. In this blog, we’ll dive deep into everything you need to know about Llama 3.1 and explore its features.

Source: Meta Llama3

1. Overview

  1. Meta has released a new set of foundation models, which have been added to their pre-existing family of Llama 3.
  2. The new members of the Llama 3 family are called Llama 3.1 and have models ranging from 8B to 70B to a massive 405B parameters.
  3. The 405B parameter variant is the largest dense transformer model, with a context window of 128K tokens.
  4. These models natively support multilinguality, coding, reasoning (solving complex problems ), and tool usage.
Source: Meta Llama3

2. Training

The development of those foundation models consists of two stages:

2.1 Pre-training

The model is exposed to a massive amount of data. The task is to predict the next word.

In classical ML terms, it is a classification task, but with a massive number of classes equal to the total number of tokens in the vocabulary of training data. During this stage, models learn the nuances of languages, master the grammar, and learn the language’s syntax, ultimately developing a comprehensive “worldview,” becoming jacks of all trades, and gaining the capability to generate human-like text.

The 405B parameter model is re-trained on 15.6T tokens along with a context window of 8K tokens, gradually increasing it to 128K tokens.

2.1.1 Pre-Training Data

Most of the data used for training has been collected from the web, and various data cleaning strategies were used to get high-quality data and get rid of low-quality data. Some of those strategies include:

  • Personally Identifiable Information (PII) and safety filtering: Filters to remove PII and adult content from the data were designed and implemented.
  • De-duplication: Duplicated data was carefully removed to avoid unnecessary data repetition during training at URL, document, and line levels.
  • Model-based quality filtering: Various models, such as llama2, DistillRoberta, and fasttext (Library for efficient text classification and representation learning), were used to subselect high-quality tokens.

2.1.2 Data Mix

The choice of the right proportion of different data sources in the pre-training data mix is essential to obtaining a high-quality language model.

The data mix consists of approximately 50% general knowledge tokens, 25% tokens related to mathematics and reasoning, 17% code tokens, and 8% multilingual tokens.

2.2 Post-Training

After pre-training, the model becomes good at predicting the next word or token in a sequence. However, it still struggles to follow instructions accurately and may produce responses that aren’t quite right or don’t sound human-like.

To address this, the model undergoes instruction fine-tuning, where it is further trained using a dataset of instructions paired with their correct responses. This helps the model learn how to generate more accurate and appropriate answers.

After instruction fine-tuning, the model goes through an alignment process to improve its responses further. In this stage, human reviewers manually check and rank the model’s answers, ensuring they are accurate and aligned with human expectations.

Processes like Rejection Sampling (RS) and Direct Preference Optimization (DPO) help the model learn to generate even more reliable and human-like responses.

3. Model Architecture

Llama3 is based on a standard decoder-only transformer architecture.

Source

To maximize training stability minor adaptations were made to the current model rather than using a mixture of expert (MoE) models.

Source: Meta Llama3

4. Multimodality

Experiments related to infusing multimodal capabilities consist of three additional stages:

  1. Multi-modal Encoder Pre-training: Separate encoders were trained for images and speech to enhance the model’s understanding of different types of data. For the image encoder (ViT-H/14 variant, 630M parameters), a vast number of image-text pairs were used, enabling the model to learn how visual information corresponds to natural language descriptions. For the speech encoder, a self-supervised approach was employed, where parts of the speech input were masked and the model was tasked with reconstructing the missing segments using discrete token representations. This technique helps the model grasp the structure and patterns within speech signals.
  2. Vision adapter training: A separate adapter is trained that integrates the pre-trained image encoder with the pre-trained language model, which helps feed the image encoder representations into the language model. This adapter is nothing but a series of cross-attention layers, that are trained on text-image pairs. During the training of the adaptor, the parameters of image encoders are updated but language model parameters are frozen.
  3. Speech adapter training: The speech encoder is integrated into the model using an adapter that converts speech encodings into token representations compatible with the fine-tuned language model. During a supervised fine-tuning stage, the parameters of both the adapter and encoder are jointly updated to ensure high-quality speech understanding. Similar to the vision adaptor process, only the parameters of the adapter and encoder are jointly updated, keeping the parameters of the language model fixed.

By not updating the language model parameters, it ensures that the model’s performance remains stable and is not negatively impacted by the integration of visual and speech components.

As of the date on which the blog is written, the multimodal model is yet to be released and is under active development.

Source: Meta Llama3

5. Capabilities:

5.1. Code:

  • Supports high-priority programming languages: Python, Java, Javascript, C/C++, Typescript, Rust, PHP, HTML/CSS, SQL, and bash/shell.

5.2. Multilinguality

  • As opposed to earlier models for the Llama 3 family, the latest models with 8B, 70B, and 405B parameters are multilingual in nature, thus supporting various languages, namely English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai.

5.3. Math and Reasoning

  • Various methodologies, such as augmenting training data with step-wise reasoning traces, filtering incorrect reasoning traces, interleaving codes and text reasoning, and employing feedback-based learning, helped the model‘s ability to reason accurately and solve complex problems.

5.4. Long Context

  • The Llama 3.1, 405B model flaunts 128K tokens as context windows, which is huge as compared to previous models from the same family, which have 8K tokens as context window size.

5.5. Tool Use

  • Llama 3 is trained to interact with several external tools, enhancing its ability to handle complex queries.
  • It can use the Brave Search engine to retrieve up-to-date information beyond its knowledge cutoff, a Python interpreter to execute code for tasks like data analysis or visualization, and the Wolfram Alpha API for solving math and science problems with precision.
  • The model is adept at using these tools within a chat setup, effectively solving multi-turn queries by planning and executing tool calls in sequence, and reasoning at each step.
  • Additionally, Llama 3 is trained to use tools in a zero-shot manner, meaning it can understand and correctly apply new tools based on in-context definitions, even if they are previously unseen.

5.6. Factuality

  • The post-training process is designed to help the model recognize what it knows, rather than add new knowledge.
  • A knowledge probing technique is used, leveraging Llama 3’s in-context abilities. This process involves extracting data snippets from the pre-training data, generating factual questions about these snippets, and having the model produce and evaluate its own responses.
  • The aim is to encourage the model to answer only questions it knows about and to refuse to answer when it is unsure. Additionally, a small set of labeled data is used to address factual inconsistencies, especially in sensitive topics, ensuring the model’s responses remain accurate and reliable.

5.7. Steerability

  • It is the ability to control or guide a model’s behavior to meet specific goals or requirements set by developers or users.
  • For Llama 3, this means being able to easily adjust its responses in terms of length, format, tone, or even persona using simple natural language instructions. This makes the model flexible and adaptable to different applications and user needs.
Source: Meta Llama3

6. Conclusion

Llama 3.1 models demonstrate strong performance across a range of benchmarks, especially in key areas like reasoning, code generation, and multilingual tasks. When compared to other leading models such as GPT-4 and Claude 3.5, Llama 3 consistently performs well, particularly in the ARC Challenge (Reasoning), GSM8K (Math), and BFCL (Tool use) benchmarks. Its superior results in these categories suggest that Llama 3 is well-suited for complex, multi-faceted tasks that require robust understanding and manipulation of information.

Furthermore, Llama 3’s ability to handle diverse data types, combined with its competitive performance across various benchmark tests, makes it a versatile and reliable choice for a wide range of applications. Whether you’re working on general knowledge tasks, advanced reasoning, coding, or multilingual projects, Llama 3.1 provides strong performance, making it a compelling option to consider for your needs.

--

--

Abhishek Selokar

Masters Student @ Indian Institute Of Technology, Kharagpur || Thirsty to learn more about AI