Chatbot evolution: how enterprises can use the latest language models like ChatGPT safely — Part 1

8 min readFeb 2, 2023

Introduction

In recent years, natural language processing (NLP) has seen a rapid advancement in capability thanks to the development of transformer architecture, also commonly known as large language models and the most recently hyped term: generative AI.

These models, such as GPT-3, have achieved remarkable results in various NLP tasks, such as language translation, question answering, and text summarisation.

Thanks to the release of ChatGPT — the chatbot space is one to watch over the next few years.

As technology advances, chatbots have become a vital tool for businesses of all sizes. They are cost-effective, available 24/7, and can handle various tasks.

But even with the invention of transformers, chatbots in the enterprise have still been primarily big logic trees and rudimentary in their ability — especially compared to academic performance.

So why have these models achieved superhuman performance in research and Natural Language Processing applications (NLP), but aren’t finding themselves in public-facing chatbots?

How do these super-chatbots work, and what will it take for enterprises to leverage this technology?

This blog series seeks to answer these questions as I take you on a journey to understand the recently hyped technology’s strengths and weaknesses and explain what is needed for large organisations to make the most of it.

I’ve split this series into 7 chapters:

What are Large Language Models and Transformers?
What Makes ChatGPT So Unique?
The Strengths and Weaknesses of Large Language Models
Knowledge Graphs: the Yin to Transformers Yang
Achieving Chatbot Excellence: Lessons from the Top 5 Performers
Real-time Inferencing — the opportunity for chatbots to provide a super-agent experience
Bringing this into production: ML Ops

Onwards to our first chapter!

In this chapter, we’ll be covering:

What are Large Language Models and Transformers?
Why did large language models create a step change in performance vs previous NLP techniques?
How do we prove large language models are achieving super-human performance?
Summary

What are transformers and large language models?

No, sadly, Optimus Prime and his transformer friends will not become your personal assistants. Source.

Before we begin our deep dive into the strengths and weaknesses, we’ll do a quick crash course on the basics to make sure we’re all on the same page:

What are transformers and language models?
Why did transformers create a step-change in performance?

What are large language models?

Overview of the LLM landscape and popular models/tools. Source.

Have you ever noticed how Google suggests the following words in your search query as you type?

This convenient feature is powered by Large Language Models (LLMs).

In a nutshell, these models are responsible for predicting the likelihood of word sequences in a way that mimics human writing.

They’re powered by a clever neural network architecture design called a transformer.
And they’re called large, as we’ll dive into later because they’re trained on the entire internet's worth of text data.

Why did large language models create a step change in performance vs previous NLP techniques?

BLEU scores (Higher is better) from Google’s Transformer: A novel approach article

The step change in language models is driven by four factors:

Transformer architecture
Attention mechanism
Parameter size
Quantity of data they’re trained on

Transformer mechanism (explained simply)

The transformer architecture is a type of neural network architecture introduced in the paper “Attention Is All You Need” by Google researchers in 2017.

The key innovation of the transformer architecture is using self-attention mechanisms in place of traditional recurrent neural networks (RNNs) or convolutional neural networks (CNNs).

The transformer architecture consists of an encoder and a decoder, which consist of multiple layers.

The encoder takes in an input sequence and produces a set of hidden states, which the decoder uses to create an output sequence.

Each layer in the encoder and decoder consists of two sub-layers: a multi-head self-attention mechanism and a feed-forward neural network.

The multi-head self-attention mechanism allows the model to weigh the importance of different parts of the input sequence to make predictions; meanwhile, the feed-forward neural network transforms that into something the decoder can consume.

Self Attention mechanism

The encoder self-attention distribution for the word “Law” is an example of all words being distributed against each other within the sentence.

Attention allows the model to focus on the most important parts of the text and ignore irrelevant information.

In summary, the transformer mechanism is a type of algorithm that enables the model to understand the context of a sentence by breaking it down into small chunks and then applying mathematical operations to each chunk.

This architecture makes the model more comprehensive and sophisticated in understanding human language.

Parameter size

The parameter size in LLMs refers to the number of variables or parameters the model has.

A parameter or variable refers to a value or setting that the model uses to understand and process text data. These parameters are responsible for the model’s ability to learn and replicate patterns and nuances of human language.

Examples of parameters in LLMs include the number of layers in the neural network, the number of neurons in each layer, the activation function used, and the learning rate. These parameters are set by the developers and can be adjusted to improve the model’s performance.

When a model is trained, the parameters are adjusted during training to find the best values that will optimise the model’s performance. In other words, the values of the parameters are trained to fit the data and the model’s objective.

An increase in parameter size allows the model to process and understand larger amounts of text data. With more parameters, the model can learn and replicate the patterns and nuances of human language.

Consequently, this makes the model more accurate and realistic than traditional NLP algorithms.

The most recent and high-performing LLMs have been reaching ludicrous amounts of parameters.

The quantity of data they’re trained on

Models like GPT-3 and PaLM are trained on massive amounts of text data, with GPT-3 being trained on a dataset called WebText (570GB) and Common Crawl (4.5TB).

The datasets used to train these models are typically sourced from various sources, such as books, articles, and websites, and include diverse topics.

It’s hard to emphasise how large these corpora of text are. Your typical word document of 1500 words will be roughly the size of 15kb. In contrast, the datasets for GPT-3 was 570GB– that’s around 36 million times bigger than your average document!

By consuming more data, LLMs can learn a broader range of language patterns and variations, which allows them to replicate human language and perform a wide range of tasks such as language translation, question answering, and more.

How do we prove large language models are achieving super-human performance?

It’s much harder for enterprises to compare each other's chatbot performance due to various political, governance and technical reasons.

This being said, I do believe there is room to close what is currently, a large gap in this field (I’ll hopefully have an announcement very soon!)

In the academic space, however, there are numerous benchmarks datasets and success criteria that are used to compare the latest models' performance against each other, some popular ones being:

SuperGLUE is currently one of the go-to benchmarks for language understanding.

As you can see in the current leaderboard, seven models are rated above human-level performance.

Note: Average humans were put to the same tests and scored 89.8. Putting English literature or linguistic experts as a benchmark would produce a much higher score that wouldn’t be fruitful to use for the current capability.

A brief example of how these models are put to the test would be to highlight one of my favourites:

Winograd Schema Challenge

The Winograd Schema challenge was named after its creator Terry Winograd, who is a computer scientist and philosopher.

The Winograd Schema Challenge was created because of the limitations of traditional NLP algorithms, which struggle with interpreting pronoun antecedents and coreference resolution (understanding the relationships between words in a sentence and figuring out what words like “he”, “she”, and “it” refer to).

These are critical elements of language that humans process naturally but are difficult for computers to understand.

The Winograd challenge aims to test AI’s ability to understand and process these language nuances.

The challenge consists of a set of pairs of sentences, each containing a pronoun with an ambiguous antecedent.

The goal is to determine the antecedent of the pronoun correctly. The sentences are designed to be simple and straightforward, but the ambiguity of the pronoun creates a challenge for AI algorithms.

The test is designed to be more difficult than traditional NLP tasks, such as sentiment analysis so that the community is driven to overcome the challenge and ultimately progress us to more capable AI!

So, to summarise:

Transformers and large language models are advanced NLP models that predict word sequences in a human-like way.
Performance improvement is due to transformer architecture, attention mechanism, parameter size, and the amount of training data.
Transformer architecture uses self-attention instead of traditional neural networks, allowing the model to focus on important parts of the text.
The parameter size impacts the model’s ability to learn language patterns, with larger parameters allowing for more accurate results.
Large language models are trained on massive amounts of diverse text data, improving their ability to replicate human language.
Yes, we really can say that we’re entering the area of chatbots achieving super-human performance in narrow domains!

Onwards to the next chapter!