Understanding Generative AI: A Leap to Language Prediction

A conceptual understanding of Generative AI, without the maths and technical detail

6 min readSep 2, 2023

Since the release of ChatGPT in November 2022, everyone has heard of ‘GenAI’ or Large Language Models (LLMs), and may even have tried using them. However, most people don’t know what’s happening behind the scenes. This elicits responses ranging from “I don’t understand it, so I don’t trust it”, to “It seems to work, so I’ll use it”.

Granted, there is a high level of mathematical complexity underpinning the inner workings of these models, but it is not hard to gain a basic, conceptual understanding of what the models are doing, which can not only encourage uptake but more responsible and productive application.

Generative AI

Generative AI is a subset of AI (or perhaps more precisely machine learning) involving models that can approximate and generate new data with similar characteristics to an original source.

For generating language, this is essentially achieved by the model “guessing” the next word in a sequence.

This answer, simple as it may seem, is the crux of a far more intricate and fascinating process, one that revolves around language modeling.

Modeling Language

Language models are statistical structures that represent the intricate, probabilistic aspects of language. That is, they provide a representation of the likelihood that a letter or word will appear within a sequence of letters or words. For example, in the following

“That was amazing, I’d […] to do it again”

The word “love” is much more likely to appear in the brackets than the word “hate”. And both words are more likely to appear than “table”.

A simple type of language model, known as the bigram model, predicts the next letter in a sequence based on the previous letter. It does this by gathering probabilities of letter pairs from large quantities of text. These statistics can be easily found today using a basic computer and will tell you that, for example, an H is most likely to be followed by an. However, even without a computer, you can try this by picking a book and doing the following:

Write down a letter
Open the book to a random page
Look for the first occurrence of the last letter you wrote
Write down the letter after it
Repeat from 2

The key here is choosing a random page. It follows that the bigram you find will naturally follow the general linguistic statistics. If you want to make your model more accurate, you could complete this exercise in a library and replace step 2 with “open a random book to a random page”.

However, there is a major shortfall in this model; the lack of context by only looking at the previous letter. For example, QU is much more likely to be followed by a vowel than a consonant, but this is not relevant if all you can see is the U. The model can be improved by using the previous two (tri-gram), three (4-gram) or more letters, instead of just one. This provides more context to improve the model but raises another issue.

In a given text, one can imagine that the bigram QW appears with a probability of 0% (that is, it doesn’t appear), which is likely true for the English language as a whole. However, as the context size is increased, the number of possible combinations of letters increases from 676 for bigrams, to 17,576 in trigrams and 456,976 in 4-grams. To ensure each possible combination not only appears, but that its probability is accurately represented, the text would need to be prohibitively large.

From letters to words

We can complete the same exercise for words, instead of letters. Take a given number of words, and guess the next word based on probabilities gathered from existing text. However, it is estimated that there are one million words in the English language. This is significantly larger than the 26 letters we have studied until now, and the number of combinations quickly becomes unwieldy (10¹² just for a bigram word model). Therefore, the model is unlikely to have seen any prompt text before and cannot accurately ‘guess’ the next word.

From words to numbers

There are techniques that can create numerical representations of words, called word vectors. They do this by looking at the words that can generally be seen around the given word. In the words of Walter Kaiser,

It is plain to see that words, like people, are known by the company they keep

The concept of a numerical representation of words may seem strange, but imagine plotting words on a graph. The word “big” may appear near the word “large” and “red” near “blue”.

Extending this to more than two or three dimensions, we can no longer visualize them on a simple graph but can produce more complex representations of the words. For example, in one dimension “big” may be very close to “small” representing the fact that they both define size; but in another dimension they may be far from each other, representing their ‘opposite-ness’.

Numerical representations allow us to do maths with words. For example,

king — man + woman = queen

London — England + France = Paris

More generally, we can now identify ‘similar’ words by, for example, finding words that are near each other. As we have numbers, we can ‘near can be calculated as something as simple as distance. The consequence of this is that, where we have not observed an exact sequence of words in our existing text, we can look for similar sequences and observe what words generally follow them as a suitable next-word prediction. This overcomes one of the major issues we had with our n-gram models.

Large Language Models

Now that we have a method of representing language numerically, we can start to employ more complex language modeling techniques.

Neural networks, a type of machine learning model, find underlying relationships between data in a way that is meant to mimic the human brain.

These neural networks can model language by learning from vast quantities of text data (think the scale of the whole internet). They use this data to optimize the conversion of words into a numerical representation in order to understand the complex underlying relationships of language.

One type of neural network configuration, known as a Transformer model, was designed specifically for language applications and is built to effectively predict a word based on its context. To understand the importance of this, consider what “it” refers to in the following two sentences:

The purse didn’t fit in the bag because it was too big
The purse didn’t fit in the bag because it was too small

Changing one word in the sentence changes what “it” refers to. This highlights the difficulty of understanding words without looking at their context.

As an example, Chat-GPT has a context window of 4,096 tokens (approximately 3,000 words), which means that when it ‘guesses’ the next word, it is doing so based on a huge amount of previous context, generating text that doesn’t just make sense at the sentence level, but at the paragraph and document level.

So, when you next use a Generative AI model, you’ll know that it’s taking the text you give it and ‘guessing’ the next word. It’s just a well-educated guess…