[Paper Summary] Overview of Google’s First Multimodal Model: Gemini
On 6 December 2023, Google finally launched their new multimodal model, Gemini. This is a very competitive model which achieves new SOTA performance on different evaluation benchmarks. To provide the latest information, this article will summarize the technical report of Gemini and showcase some interesting demos illustrated by Google.
1. What is Gemini?
Gemini is a set of multimodal models which can handle text, image, audio, and video data and perform various tasks across these modalities, such as language understanding, image recognition, video understanding, and audio processing. Gemini models come in three sizes: Ultra, Pro, and Nano to suit different applications, from complex reasoning to on-device deployment.
- Ultra: The best model to perform highly complex tasks, like reasoning and multimodal tasks. It will be available on Bard Advanced in January 2024.
- Pro: Well-balanced between performance and resource optimization. Good enough to perform most of the tasks. Bard is currently using Gemini Pro for text prompt queries.
- Nano: Tailor-made for…