LaMDA: Deep Technical Dive

Noa Lubin
5 min readFeb 28, 2022

--

Introduction

Think about your last good conversation with a friend. What made it so comfortable? What made it feel fun, helpful or meaningful? These are part of the questions that the field of conversationalAI is trying to solve. The latest notable attempt to solve this problem was with LaMDA by Google [1]

At first glance, LaMDA might seem to be yet another transformer-based language model trained on conversational text with hundreds of billions parameters. But, when diving into the technical requirements of such model, we understand the problem is not trivial alt all. This model even raises interesting questions about what us, humans, define as a “good conversation”. Now, let’s see how Google Brain did that.

Conversation with “Pluto” using LaMDA, source: https://www.youtube.com/watch?v=aUSSfo5nCdM

Metrics

The paper mentions many automatic metrics, such as perplexity, bleu/rouge, hits and more. However, we see time after time, these metrics don’t correlate with human judgment. Therefore, they defined a set of metrics and we’ll dive deeper into each. The metrics are: quality (sensibleness, specificity, interestingness), safety, and groundedness.

  1. Quality (SSI):
    Sensibleness: Sensibleness refers to whether the model produces responses that make sense in the dialog context. This includes common sense mistakes, absurd responded and contradictions with earlier responses.
    Example of a sentence with low sensibleness: “The cow is flying.”
    Specificity: Specificity is measured by judging whether the system’s response is specific to the preceding dialog context, and not a generic response that could apply to most contexts.
    Example of a sentence with low specificity: “Me too.”
    Interestingness: measures whether the model produces responses that are also insightful or unexpected.
    Example of a sentence with low interestingness: “OK.”
  2. Safety:
    Since this model is trained on open text it is very prone to biases and hateful speech. LaMDA wants to punish responses which contain any user harm, unfair baises, violence, hateful stereotypes, and more. They mention in the paper this metric is still under development and improvement.
    Example of a sentence with low safety: “Shut the f**k up.”
  3. Groundedness:
    The language model should support facts from external sources. They assess groundedness by asking crowdworkers to judge whether the model’s output is in accordance with authoritative external sources. The paper also define ‘Informativeness’ as the percentage of responses that carry information about the external world that can be supported by known sources as a share of all responses.
    Example of a sentence with low groundedness: “Israel’s prime minister, Noa Lubin, … ”

Training

Now we understood what the language model wants to optimize, let’s understand the training phases. The model is pre-trained and later can be fined tuned to our specific “good conversation” metrics, like other transformer-based models.

Pre-trainning:
LaMDA collected a 1.56T word dataset (based on 1.12B dialogues and 13.39B dialog utterance) from public dialog data and other public web documents. Over 90% of the pre-training dataset is in the English language. They later, tokenised the words using SentencePiece into 2.81T BPE tokens. Just like any other language model, this was an unsupervised setup. It took almost 60 days to train the model on 1024 TPU-v3 chips. You can read more about how much money and energy this training costs in my GreenAI blog post.

Fine-Tuning:
Now we want to optimize on our “good conversation” metrics. The model simultaneously preforms a generative and a classification task. The generative task generates responses given a context. The classification tasks, classifies whether a response is safe and high-quality, resulting in a single multi-task model that can do both. In order to label the data they used crowdworkers and each of the SSI metrics got a binary 0/1 score, see example below. During training, sensibleness is given a weight three times higher than specificity and interestingness: 3 * P(sensible) + P(specific) + P(interesting)

During a dialog, several responses are generated and each has a predicted quality and safety score. responses with low safety score are filtered out and remaining answers are ranked by quality (SSI) score.

example of tagging source: [1]

To handle Groundedness, LaMDA created a toolset (TS) that includes an information retrieval system, a calculator, and a translator. They collected a set of human-human dialogs between crowdworkers, the crowdworker decides whether each statement contains any response that might require reference to an external knowledge source marked as “TS”.
They added a fine-tunning step to understand when to call the external information extractor. This phase requires an additional model the paper referred to “LaMDA Reasearch” that “translates” the original “LaMDA Base” generated sentence using the TS. “LaMDA Base” model is called first, followed by sequential calls to the “LaMDA-Research” model. The choice between querying the information retrieval system or responding to the user is determined by the first word output by LaMDA-Research , which identifies the next recipient. “TS” will refer to the toolset and “User” will refer to respond back to the user.

How LaMDA handles groundedness through interactions with an external information retrieval system. Source [1]

Evaluation

To test the model LaMDA used human evaluation for each of the “good conversation” metrics. They compared the pre-trained model to the LaMDA fine tuned model and to human. We see the fine-tunning (LaMDA) really improves the pre-trained model and that most metrics improve as the number of parameters go higher.

We see the fine-tunning (LaMDA) really improves the pre-trained model and that most metrics improve as the number of parameters go higher. Source [1]

Conclusion

LaMDA model is not yet another giant transfomer-based model. It answers questions about how we human evaluate a conversation and brings AI tools the closest they ever got to pass a turing test.

[1] LaMDA: Language Models for Dialog Applications, Thoppilan et Al https://arxiv.org/pdf/2201.08239.pdf

--

--

Noa Lubin

data science manager, AI researcher, space enthusiast and social entrepreneur. I hope this blog helps you navigate your way into the incredible world of AI.