How Good Is Google Gemma — Comparing to Mistral and Qwen1.5

Yucheng Li
2 min readMar 4, 2024

--

This article briefly evaluates Google’s new Gemma, compared to the baseline models of Qwen1.5, Mistral, and Llama-2.

TL;DR

  1. Code Capability: Top tired with Qwen1.5, surpassing Mistral, but still some distance from CodeLlama.
  2. Math Capability: Does not surpass Qwen1.5, tied for second place with Mistral.
  3. Text Modeling Capability: Has a significant gap with Mistral, but significantly more robustness than Llama-2 and Qwen1.5.
  4. Academic Capability: Has a gap with Mistral, tied for second place with Qwen1.5.

1. Code

Tested on the Github dataset.

compression ratio on y-axis, lower is better

CodeLlama is the best, but the lead has narrowed on the latest data.

Next are Qwen1.5 and Gemma, slightly ahead of Mistral, significantly ahead of Llama-2.

2. Math

Tested on StackOverflow Math.

compression ratio on y-axis, lower is better

Qwen1.5 is the strongest, Gemma and Mistral are next, both ahead of Llama-2.

3. Text Modeling

Tested on English Wikipedia.

compression ratio on y-axis, lower is better

Llama-2 is the best, but the lead quickly narrows over time, indicating a high risk of overfitting.

Qwen1.5 has the same overfitting problem.

Mistral and Gemma have very stable performance, indicating strong robustness.

4. Academics

Tested on the arXiv dataset.

compression ratio on y-axis, lower is better

Mistral is slightly ahead, Qwen1.5 and Gemma are tied for second, both significantly ahead of Llama-2.

Summary Gemma performs very strongly overall, with excellent performance in many scenarios.

Leading Llama-2 is unquestionable, and it has wins and losses compared to Qwen1.5 and Mistral.

Reference

You can find more results of other LLMs including Yi, LLaMA, InternLM, CodeLlama, Baichuan, ChatGLM at:

The Method:

--

--

Yucheng Li
0 Followers

PhD at University of Surrey, UK. NLP researcher.