Reading between the lines on Gemini

Joshua Marker
As A Large Language Model…
5 min readDec 8, 2023

Last night, Google released their trailer — I won’t call it a ‘demo’ — for their new system, Gemini. In it, they show some really wild real-time multi-modal reasoning. There was a handful of other teaser videos as well, each dedicated to a couple other examples.

All had the same theme: apparent conversation with a multi-modal AI that displays a radical amount of adaptability and even a degree of wit, and are delivered by an apparently humble engineer, deadpanning as though not aware they’re claiming extraordinary advances. Hey, he seems to be saying, we’re Google. This is just what we do when we spend a few billion dollars.

Castor & Pollux, showing off

But why the puzzling 1-year-later release date? As someone pointed out, maybe for the same reason Google promised their phones would schedule haircuts for you five years ago.

But it’s still impressive, and they also publish a great overview of the technical changes they made to accomplish it. Here are a few thoughts.

A 32k context window is not huge, so what wins are giving them all this magic?

  • The innovations seem to be somewhat for training on google-specific hardware; I do see modifications to the attention technique (https://arxiv.org/abs/1911.02150) that would support multi-modality more quickly, a big issue on freeform reasoning.
  • They claim the ability to pre-train a model in weeks with these changes; note that this involved multiple datacenters in parallel, so this isn’t all about technique. It’s nice to be Google. [This is another thing the Transformer model makes possible; this wouldn’t have really worked well w/ RNNs.]
  • Their edge-capable models are “with 1.8B (Nano-1) and 3.25B (Nano-2) parameters” for low and high-memory situations. That’s ‘tiny’ (relatively)! A few billion parameters. That’ll run on your next phone. Maybe even your current one.
  • They seem to have moved multi-modality directly into the model, as they informally note. Ditto for audio/speech. They say it does audio at 16khz which is enough to catch intonation and such. I know linguists who should be spontaneously combusting right about now.

Other innovations include:

  • Training the tokenizer more exhaustively (interesting, I thought we had reached asymptote there)
  • More aggressive training data quality control (again, it’s good to have lots of money, but this is a very very good sign that a mature process could deal with the issues OAI still hasn’t dealt with.)

One experiment involved the open question of whether multiple modes would dilute the overall quality. Turns out with enough parameters, nope. Every bit helps. This is neat to me. It parallels what was found with RNNs, too: You’d think you would be better off having domain-specific reasoners as building blocks, but, nope: just piling it in works better. This is a good sign for multimodality.

Oversight & Tuning

It looks like they are doing a lot, lot, lot more model tuning, and pre-training, and data-cleanup. And a lot more human oversight.

They also have taken seriously some ‘asimov’ techinques — so called ‘constitutional’ modeling to decrease harm and hallucination.

Factuality and attribution tuning cut inaccuracy by 50%, increased attribution correctness by 50%, and (most impressive to me) raised hedging (ie, honesty about lack of confidence) enormously on a specially crafted challenge task set. This task set is not industry-standard yet (WHAT?) so I don’t have OAI numbers for it, but Google said Gemini without that training did a 0% job at it. I’m curious to see OAI’s because man it comes across like. . . . an engineer.

Finally, they did find evidence of the model ‘cheating’ — no, that’s anthropomorphizing. They did find that the model accidentally trained on data about the tests used to evaluate models, so they tossed out some of the benchmarks. Hope they found them all.

Analysis

This comes across as the smart kid in the class doing the assignment well, versus the Altman rush/all-nighter. Not to derogate OAI’s work; I simply mean that Google is publishing thoughtful benchmarks, experiment results, trying different mechanisms for dealing with hallucination, and such, and publishing it all, with footnotes.

That said, this is not a demo. I don’t see how the 32k context window could accomplish the real-time multi-modality they claim.

They cite that it can be deployed due to acceleration and their new tensor modeling (“It is efficiently serveable at scale on TPU accelerators”), and they say you can interleave video and audio data at variable resolutions to tweak lowercase-a-attention to the multiple modes, but they don’t explicitly state inference latency. Seems like an odd oversight given the real POW of this video was the conversational ability to fluidly react multi-modally. Nothing about their research overview suggests a real-time approach. I guess that’s a detail we’ll learn about it in a year, assuming aliens haven’t landed by then.

Yeah — about that year. The rule of thumb is that any delivery date a year out may as well be a thousand years out. This reads like a teaser trailer, not a finished product.

arXiv.org

Fast Transformer Decoding: One Write-Head is All You Need

Multi-head attention layers, as used in the Transformer neural sequence model, are a powerful alternative to RNNs for moving information across and between sequences. While training these layers is generally fast and simple, due to parallelizability across the length of the sequence, incremental inference (where such paralleization is impossible) is often slow, due to the memory-bandwidth cost of repeatedly loading the large “keys” and “values” tensors. We propose a variant called multi-query attention, where the keys and values are shared across all of the different attention “heads”, greatly reducing the size of these tensors and hence the memory bandwidth requirements of incremental decod… Show more

--

--