Meet Gemini: Google’s Multimodal Masterpiece that can Push AI Boundaries

Rosemary J Thomas, PhD
Version 1
Published in
5 min readDec 8, 2023
Created using Microsoft Bing Image Creator

On 6th December 2023, Google announced Gemini, the next generation foundation model. It looks like Google is grabbing the market advantage against their competitor, OpenAI to challenge GPT-4 and build a better a model. Gemini claims to be the most versatile model can be the efficiently be deployed in wide range of devices from mobile to data centres. Google claims to be responsible while developing all their AI models.

Gemini is a new family of multimodal and multilingual models from Google DeepMind that achieve state-of-the-art performance across text, image, audio, and video understanding. Its family consists of Ultra, Pro and Nano sizes suitable for different and wider range of applications. In short, Ultra is the most capable, Pro is optimised for performance and scale, and Nano is for on-device deployment. These models are trained on a large dataset of web documents, books, code, images, audio, and video. These models can process interleaved inputs across modalities.

Gemini is built upon the advancements made in PaLM 2, their previous model. There are some major improvements reflected in Gemini. First, Gemini is designed to be multimodal and multilingual from the foundation upwards which means it has the ability to understand and generate not just text, but also images and other types of media. Second, Gemini is highly competent at integrating with various tools and APIs. This allows it to interact with a wide range of applications and devices, enhancing its versatility and usability. Commencing December 13, Gemini Pro can be used by developers and enterprise customers via the Gemini API in Google AI Studio or Google Cloud Vertex AI. Third, Gemini is designed with future inventions in mind, such as memory and planning which suggests that it might be capable of more intricate tasks and reasoning, building upon PaLM’s reasoning capabilities. Finally, before the public release, Gemini has been rigorously tested in line with responsible AI practices which makes sure that it can be implemented across various products, applications, and devices for wider societal good.

The Making of Gemini

Gemini was trained using multiple sets of TPUv5 accelerators, across several data centres. A surplus of in-memory copies of the model state were operated, and on any unplanned hardware breakdowns, recovery was made directly from a complete model duplication. This is a massive expansion and upgrade compared to their prior flagship model PaLM-2. The end-product being a general goodput for the bulkiest training job swelling from 85% to 97%. The magnitude of Gemini’s operations means that Silent Data Corruption (SDC) incidents can be anticipated to crash training every week or two. Their completely deterministic infrastructure permits rapid detection of the source (eg. hardware breakdowns) during the development which made a massive contribution concerning stable training.

Impact assessments, mitigation research, model policies, and external evaluations around safety, fairness, and robustness were included in the responsible AI development process of Gemini.

Main Features of Google Gemini

Mastering Human-Style Conversations, Language, Content, and Image Interpretation: Gemini Ultra exceeds human-expert achievement by securing 90% on the test benchmark Massive Multitask Language Understanding (MMLU) demonstrating Gemini’s competence to master human-style conversations, language, and content. MMLU was used to determine progress for Language Models since it was announced in 2020. Additionally, Gemini models can describe intricate images including, charts, infographics etc. and explain interleaved chains of images, audio, and text, demonstrating its ability to interpret images.

Driving Data Analytics and Creating New AI Apps and APIs: There is a blanket of possibilities of new apps with the innovative capabilities of Gemini models. Gemini can foster new advances in themes including education, everyday problem-solving, multilingual communication, information summarisation, extraction, and creativity. This shows Gemini’s possibility in driving data analytics. Developers can use these models when available to create new AI apps and APIs, leading to AI for good.

Gemini vs. Other LLMs

To date, OpenAI’s GPT models, like GPT-3 and GPT-4, and Claude are some of the largest language models. Gemini establishes new state-of-the-art (SOTA) performance results across major language and text benchmarks, often with zero-shot evaluation. In direct parallels, Gemini models has a close match or go beyond GPT performance on major language benchmarks while having additional multimodal capabilities (see Figure 1 below). Gemini does markedly well on more intricate reasoning tasks compared to GPT. For example, in math and coding tests. Gemini is also developed to be safer, more helpful and aligned with human preferences. Early findings suggest advantages over GPT here. So, in brief, Gemini aims to be competitive with the best existing LLMs like GPT-4, while having native multimodality, stronger reasoning, and more oversight into safety. The goal is to advance towards more generally capable AI.

Figure 1. Gemini performance on text benchmarks with other models

Future of Gemini

It looks like Gemini was developed to boost Google’s enterprise-based products. For example, products like Google Docs, Slides, Chrome, Meet and more could gain from Gemini’s sophisticated language comprehension and generation capabilities. This could spearhead more supportive processes, tailored customer dealings, and elevated modernisations. When developers will be able to access the Gemini AI app via Google Cloud, it could lead to the inventions and revolution of new AI apps in the market.

The future work looks to revolve around developing Gemini models into a modularised system with general cross-modal reasoning and understanding capabilities. The reported benchmarks seem to show its extraordinary capabilities, but limitations around LLMs reasoning and hallucinations linger. More robust evaluations and research to advance LLMs is needed due to overloaded current benchmarks.

With the broad range of applications of Google Gemini in the field of AI we are yet to uncover its real-world capacities that is, how it can achieve intelligent solutions, promote scientific progress, and improve human welfare.

About the author

Rosemary J Thomas, PhD, is a Senior Technical Researcher here at the Version 1 AI Labs.

--

--