Language Models

Jeff Foster
Ingeniously Simple
Published in
5 min readOct 13, 2021

A language model is a probability distribution over a sequence of words. As an example, a language model for English should be able to predict the next item in a sequence, or ideally generalize to responding to a question with a well-formed response. As engineers, we use language models all the time! When you use tab-completion (or Intellisense), you’re using a language model.

Eliza

Eliza is one of the first language models. Eliza originated in 1964 at MIT and it was really designed to demonstrate the superficial nature of conversation with a non-directional psychotherapist. Depending on the human (!), it’s one of the first chat bots capable of attempting the Turing test, but it’s important to note it was never really designed to do that, it was really just taking the piss out of psychotherapists.

A conversation with Eliza

So, how does Eliza work? Well, there’s really just one trick. Eliza uses pattern matching to recognize common phrases, and replies with a bit of substitution trickery.

A peek under the curtain of Eliza

It’s amazing how well this approach works! However, this does have the drawback that to cover every scenario, Eliza requires a human to rigorously define the entire syntax and semantics of the language. That’s quite hard, the English language is wonderfully ambiguous!

  • “I shot an elephant in my pyjamas” (Groucho Marx)
  • “Call me an ambulance!”
  • “I’ve never tasted dinner like that before”

Eliza is a neat trick, but it’s not a super generalizable model!

N-gram models

N-gram models are a probabilistic language model for predicting the next item. N-gram models use N-grams to pick the next item based on their frequency distribution from some sample text.

4-grams for “once upon a …”

The above are the 4-grams for “once upon a <blank>”. The percentages indicate what proportion of four word sequences that occur in the test corpus are the given words. We can see from this that if you have a sentence that begins “once upon a” then the probability of the next word being “time” is very high (in fact, it’s 700x more likely to be “time” than “long”).

Building an N-gram model is dead simple and it’s a really fun program to write. Here’s my attempt.

All you do is find all pairs (in this case, as I’m making a 2-gram model) and index them. To generate the words, pick a random word and then keep rolling the dice and picking according to the frequency distribution.

N-grams are surprisingly powerful! Below is some generated text from Romeo and Juliet. It’s not going to win any awards, but if you squint a bit, it’s fairly realistic gibberish.

“the all-cheering sun for all forth: well, think it is it well. balthasar. news be taken. — stay awhile. — stand up. romeo. he of this contract tonight; for earth to a month, a princox; go: be fourteen. how should be talked on, transcribe and paper, and sails upon that very night by thy wild acts denote the other terms of all the singer. i married then starts up, i’faith. will you run mad. o, he’s a madness most you pardon me to move is yond yew tree lay fourteen of wax. lady capulet. o lamentable chance? the constable’s own word: and how my”

The problem with N-gram models is that it has no memory and no context, it’s just relentlessly following the model. Google have built a HUGE n-gram model (see https://ai.googleblog.com/2006/08/all-our-n-gram-are-belong-to-you.html) and it’s very useful for spelling correction and simpler tasks like named entity recognition.

Neural Networks

A neural network takes 1 or more inputs and produces some output.

Obligatory diagram of Neural Networks

For example, it might be a model to calculate the price of a house. Inputs might be the number of bedrooms and sq/feet of the house, and the output is a prediction of the price.

Neural networks have been around forever, but until relatively recently they’ve been a bit dormant. What’s changed things around is vectorization, the ability to get GPUs/custom silicon to execute and train neural networks with way more parameters before. The Nvidia A100 delivers 312 TERA FLOPS. 312000000000000. That is a 3 with 14 zeros on the end. And that’s quite a lot of zeros. Couple that with advances in software (differentiable programming) and you’ve entered the world of deep-learning.

Neural networks are optimization over functions — throw enough hardware/layers at it, and they’ll produce a way to minimize the error on a given training set. Recurrent neural networks take this further, they represent optimization over sequences. I’ll never be able to explain RNNs as eloquently as Karpathy, so please take a few minutes and read http://karpathy.github.io/2015/05/21/rnn-effectiveness/!

The current state of the art model is known as GPT-3. The training data consists of unlabelled data, and the goal of the model is to successfully fill in the blanks. This model consists of 175 Billion parameters (!). https://paperswithcode.com/method/gpt-3 is a great resource to learn more.

This model is tremendously powerful! The examples below illustrate the model being used to complete source code. Some of it is jaw-droppingly amazing, some of it is a bit more meh!

Search out more Github Co-pilot to learn more.

Examples: Given the first 2 lines, generate the rest!

Conclusion

Deep Learning is promising a revolution (with language models enabling more conversational magic), but it’s important to note that we’ve been here before (See the AI Winter).

I think this time is different though! Almost everything is digital nowadays, giving a huge amount of training data for almost any scenario. Computing performance is still doubling every few years (at least with the number of cores), and there’s rapid innovation around neural networks once again.

Exciting times!

--

--