Papers Explained 64: Mistral

Ritvik Rastogi
DAIR.AI
Published in
6 min readOct 23, 2023

Mistral 7B is an LLM engineered for superior performance and efficiency. It leverages grouped-query attention (GQA) for faster inference, coupled with sliding window attention (SWA) to effectively handle sequences of arbitrary length with a reduced inference cost.

Mistral 7B outperforms the best open 13B model (Llama 2) across all evaluated benchmarks, and the best released 34B model (Llama 1) in reasoning, mathematics, and code generation.

Mistral 7B — Instruct, model fine-tuned to follow instructions on instruction datasets publicly available on the Hugging Face repository, surpasses Llama 2 13B — chat model both on human and automated benchmarks.

Architecture

Mistral 7B is based on a transformer architecture.

Compared to Llama, it introduces a few changes:

Sliding Window Attention

Sliding Window Attention leverages the layers of a transformer model to extend its attention beyond a fixed window size, denoted as W. In SWA, the hidden state at position i in layer k can attend to hidden states from the preceding layer within the range of positions i — W to i, allowing access to tokens at a distance of up to W * k tokens. By employing a window size of W = 4096, SWA theoretically achieves an attention span of approximately 131K tokens. In practice with a sequence length of 16K and W = 4096, SWA modifications in FlashAttention and xFormers result in a 2x speed enhancement compared to vanilla attention methods.

Rolling Buffer Cache

A Rolling Buffer Cache, employs a fixed attention span to limit cache size. The cache is of fixed size W, and it stores keys and values for timestep i at position i mod W in the cache. When i exceeds W, earlier values are overwritten, halting cache size growth. For instance, with W = 3, on a 32k-token sequence, cache memory usage is reduced by 8x without compromising model quality.

Pre-fill and chunking

In sequence generation, tokens are predicted sequentially based on prior context. To optimize efficiency, a (k, v) cache is pre-filled with the known prompt. If the prompt is very long, it is chunked into smaller segments using a chosen window size. Each chunk is used to pre-fill the cache. This approach involves computing attention both within the cache and over the current chunk, Thus aiding in more effective sequence generation.

Results

Mistral is evaluated against the following benchmarks:

  • Commonsense Reasoning (0-shot): Hellaswag, Winogrande, PIQA, SIQA, OpenbookQA, ARC-Easy, ARC-Challenge, CommonsenseQA
  • World Knowledge (5-shot): NaturalQuestions, TriviaQA
  • Reading Comprehension (0-shot): BoolQ, QuAC
  • Math: GSM8K (8-shot) with maj@8 and MATH (4-shot) with maj@4
  • Code: Humaneval (0-shot) and MBPP (3-shot)
  • Popular aggregated results: MMLU (5-shot), BBH (3-shot), and AGI Eval (3–5-shot, English multiple-choice questions only)
Performance of Mistral 7B and different Llama models on a wide range of benchmarks.
Comparison of Mistral 7B with Llama.
  • Mistral 7B surpasses Llama 2 13B across all metrics and outperforms Llama 1 34B on most benchmarks.
  • In particular, Mistral 7B displays superior performance in code, mathematics, and reasoning benchmarks.

Instruction Following

Comparison of Chat models.
  • Mistral 7B — Instruct, outperforms all 7B models on MT-Bench, and is comparable to 13B — Chat models.
  • In an independent human evaluation, conducted on https://llmboxing.com/leaderboard. The outputs generated by Mistral 7B were preferred 5020 times, compared to 4143 times for Llama 2 13B.

Mistral 7B-v0.2

Mistral-7B-v0.2 has the following changes compared to Mistral-7B-v0.1

  • 32k context window (vs 8k context in v0.1)
  • Rope-theta = 1e6
  • No Sliding-Window Attention

Mistral 7B-v0.3

Mistral-7B-v0.3 has the following changes compared to Mistral-7B-v0.2

  • Extended vocabulary to 32768
  • Supports v3 Tokenizer
  • Supports function calling

Codestral 22B

Codestral is a 22B open-weight code model, specifically designed for code generation tasks, which is an open-weight generative AI model. It is trained on a diverse dataset of over 80 programming languages, including popular ones like Python, Java, C, C++, JavaScript, and Bash, as well as more specific ones like Swift and Fortran. This broad language base enables Codestral to assist developers in various coding environments and projects.

Codestral can save developers time and effort by completing coding functions, writing tests, and filling in partial code using a fill-in-the-middle mechanism. Interacting with Codestral can help developers improve their coding skills and reduce the risk of errors and bugs.

Codestral is licensed under the Mistral AI Non-Production License, allowing it to be used for research and testing purposes.

Setting the Bar for Code Generation Performance

As a 22B model, Codestral sets a new standard on the performance/latency space for code generation compared to previous models used for coding.

With its larger context window of 32k (compared to 4k, 8k or 16k for competitors), Codestral outperforms all other models in RepoBench, a long-range eval for code generation.

Additionally, Codestral’s performance is evaluated in multiple HumanEval pass@1 across six different languages in addition to Python: C++, bash, Java, PHP, Typescript, and C#, and calculated the average of these evaluations.

Codestral’s Fill-in-the-middle performance was assessed using HumanEval pass@1 in Python, JavaScript, and Java and compared to DeepSeek Coder 33B, whose fill-in-the-middle capacity is immediately usable.

Mathstral

Mathstral is a 7B model designed for math reasoning and scientific discovery based on Mistral 7B specializing in STEM subjects. It achieves state-of-the-art reasoning capacities in its size category across various industry-standard benchmarks. The model has a 32k context window.

In particular, it achieves 56.6% on MATH and 63.47% on MMLU.

MMLU performance: difference by subject between Mathstral 7B and Mistral 7B.

Mathstral can achieve significantly better results with more inference-time computation: Mathstral 7B scores 68.37% on MATH with majority voting and 74.59% with a strong reward model among 64 candidates.

Paper

Mistral 7B 2310.06825

Codestral: Hello, World!

MathΣtral

Hungry for more insights?

Don’t miss out on exploring other fascinating threads in this series. Simply click here and uncover the state-of-the-art research!

Do Subscribe for weekly updates!!

--

--