[Paper Summary] Overview of Google’s First Multimodal Model: Gemini

Thomas Chong
9 min readDec 7, 2023
Source: Introducing Gemini: Google’s most capable AI model yet (blog. google)

On 6 December 2023, Google finally launched their new multimodal model, Gemini. This is a very competitive model which achieves new SOTA performance on different evaluation benchmarks. To provide the latest information, this article will summarize the technical report of Gemini and showcase some interesting demos illustrated by Google.

Screenshot from Gemini Technical Report

1. What is Gemini?

Gemini is a set of multimodal models which can handle text, image, audio, and video data and perform various tasks across these modalities, such as language understanding, image recognition, video understanding, and audio processing. Gemini models come in three sizes: Ultra, Pro, and Nano to suit different applications, from complex reasoning to on-device deployment.

  • Ultra: The best model to perform highly complex tasks, like reasoning and multimodal tasks. It will be available on Bard Advanced in January 2024.
  • Pro: Well-balanced between performance and resource optimization. Good enough to perform most of the tasks. Bard is currently using Gemini Pro for text prompt queries.
  • Nano: Tailor-made for…

--

--