How Good Is Google Gemma — Comparing to Mistral and Qwen1.5

2 min readMar 4, 2024

This article briefly evaluates Google’s new Gemma, compared to the baseline models of Qwen1.5, Mistral, and Llama-2.

TL;DR

Code Capability: Top tired with Qwen1.5, surpassing Mistral, but still some distance from CodeLlama.
Math Capability: Does not surpass Qwen1.5, tied for second place with Mistral.
Text Modeling Capability: Has a significant gap with Mistral, but significantly more robustness than Llama-2 and Qwen1.5.
Academic Capability: Has a gap with Mistral, tied for second place with Qwen1.5.

Tested on the Github dataset.

CodeLlama is the best, but the lead has narrowed on the latest data.

Next are Qwen1.5 and Gemma, slightly ahead of Mistral, significantly ahead of Llama-2.

Tested on StackOverflow Math.

Qwen1.5 is the strongest, Gemma and Mistral are next, both ahead of Llama-2.

Tested on English Wikipedia.

Llama-2 is the best, but the lead quickly narrows over time, indicating a high risk of overfitting.

Qwen1.5 has the same overfitting problem.

Mistral and Gemma have very stable performance, indicating strong robustness.

Tested on the arXiv dataset.

Mistral is slightly ahead, Qwen1.5 and Gemma are tied for second, both significantly ahead of Llama-2.

Summary Gemma performs very strongly overall, with excellent performance in many scenarios.

Leading Llama-2 is unquestionable, and it has wins and losses compared to Qwen1.5 and Mistral.

You can find more results of other LLMs including Yi, LLaMA, InternLM, CodeLlama, Baichuan, ChatGLM at:

liyucheng09.github.io

The Method:

arxiv.org