Mistral 7B Beats Llama v2 13B on All Benchmarks: Overview and Fine-tuning
A few days ago, the Mistral AI team released Mistral 7B, which beats Llama 2 13B on all benchmarks and Llama 1 34B on many benchmarks, and is almost on par with CodeLlama 7B.
“Community-backed model development is the surest path to fight censorship and bias in a technology shaping our future” — Mistral AI
Looking good!
If you are running macOS or Linux, you can try it locally using Ollama. It should give you 25 tokens/second on a low-end Macbook based on reports, which is already reasonable - we want to hear what performance you get out of your system, and we’d appreciate it if you could drop that in the comments.
Let’s see how Mistral AI team created this beast.
There are 3 techniques that Mistral 7B leveraged to offer faster inference and the ability to handle longer sequences with minimal computational overhead.
1. Grouped-query attention (GQA)
Attention mechanisms, the cornerstone of transformer architectures, have traditionally been computationally intensive, especially when…