Solving NLP One Hug at a Time

Published in

Kainos Applied Innovation

12 min readMar 22, 2021

A brief look into the change in the NLP landscape and a practical example of it with Hugging Face.

Creating a single, general-purpose application that can perform a number of natural language tasks such as questioning and answering, language translation, generative text and more, has been at the forefront of natural language processing (NLP) since its inception as far back as the 1950s!

Recent developments suggest we are one step closer to realising that goal.

Index

Along Came A Transformer ☍
Transformers: Stuck Like GLUE At The Top ☍
It’s Not Easy Being A Transformer ☍
Does Bigger, Always Mean Better? ☍
The Cast of Transformers ☍
Hugging Face ☍
Fine-Tuning BERT for Contradiction Detection ☍
Who’s Transforming ☍

Along Came A Transformer

[Index]

Transformation of a toy Optimus Prime. Source: BrownBox Reviews

The NLP field in the last few years has undergone a transform(er)-ation in both architecture and mindset. The architecture setup has evolved, moving away from popular recurrent-neural networks (RNNs) such as long-short term memory (LSTM) and into transformers — not the ones you’re thinking of! A key difference between these setups is the transformer’s ability to dynamically choose what to remember; a characteristic known as its ‘attention’ — check out Yannic Kilcher’s excellent video walkthrough to learn more!

Transformers architecture explanation from their unveiling paper, ‘Attention Is All You Need’. ⠀⠀⠀⠀⠀⠀⠀Source: YouTube

Furthermore, the discipline has taken a leaf out of image recognition models by producing pre-trained models; general language models. These general language models are huuuggee — like 33GB plus for NVIDA’s MegatronLM — and capture the underlying structure of basic language which is then tailored to a specific use-case; a technique called transfer learning. Previously, this was not the case.

Transformers: Stuck Like GLUE At The Top

[Index]

GLUE and SuperGLUE benchmark scores. Source: GLUE and SuperGLUE

What does this mean in practise when you put it all together? Well… its success cannot be understated: the General Language Understanding Evaluation (GLUE) benchmarks are a series of NLP tasks from sentiment analysis to questioning and answering to sentence entailment, designed to test and rank, a model’s competence. These large, pre-trained transformer models came along and demolished the competition often creating state-of-the-art (SOTA) results. Clearly, a level-up in difficulty set of NLP tasks had to be created; step in SuperGLUE.

It’s Not Easy Being A Transformer

[Index]

There are a few disadvantages to large general language models for both those who create them and those who use them. I want to take each of these perspectives in-turn and talk about them in a bit of detail in these next 2 sections — if you’re not interested in the downfalls that transformer models bring and only want to see what they can churn out, skip forward to how Kainos’ Innovation team utilised BERT.

Does Bigger, Always Mean Better?

[Index]

Creating these models requires both: excessive computational power and a large data lake of text pieces. The latter of this comes with its own headaches to collect, parse, standardise and store hundreds of TBs of data. Computational power to train language models is a challenge; GPUs are matrix computation and job parallelism specialists — perfect for training machine learning models and speeding this process up n-fold. That being said, even using flagship GPU processors on efficient cloud networks still requires maaanny hours, e.g. NVIDA’s MegatronLM took 512 V100 GPUs over 9 days to compute!

It follows too, that utilising these enormous models in production — where storing and inferencing models incur cost — creates friction. For example, the popular model hosting platform Heruoku, has a 500MB file size limit.

One way of addressing this issue is to scale back the complexity of these models; negatively impacting performance. Finding the trade-off between model size and performance is just another drop in the water for machine learning engineers to ponder over. However, an interesting outcome from pre-training these models is, as you scale the complexity up, the returns continue to get better and better without plateauing, creating SOTA models on many NLP tasks — as described in the video below.

Creating larger and larger language models seems to results in increasing and increasing performance. ⠀⠀⠀⠀Source: YouTube

However, this resembles an interesting and valuable point from the world of… mathematics proofs — bare with me! There are many different techniques within mathematics to proof things such as by induction or contradiction which involve various clever ways of thinking and manipulating the problem. Proof by exhaustion doesn’t. It simply tests all finite test cases on the problem and concludes whether they all hold up. Whilst it is both: not a computationally refined way nor a beloved way in the mathematical community, it can produce results — a famous example is the four colour theorem which was originally solved using this technique… buuutt has since been proven in a more mathematician friendly way. We have a similar situation here with these large general language models; computationally heavy to create and run but, rather differently, beloved by the community. Hold up. Let’s backtrack a few sentences there, I mentioned that this form of proof wasn’t held in high regards by mathematicians and that is for 1 simple reason:

We don’t learn anything about the subject, we simply just throw raw computational power at it

We’ve all heard the famous saying that if you gave infinite monkeys enough time they’d produce the works of William Shakespeare.

Infinite monkeys typing for an infinite amount of time would eventually type all of William Shakespeares’ plays. Source: The Creativity Post

This is an example of brute-force, by which the outcome, in this case a Shakespeare play, is the result of mashing keys at random; the output is correct but the mechanism for achieving it is sub-optimal. We do not learn anything around how Shakespeare glued together the words he did, from the perspective he did, with the emphasis he did. It was simply the output from huge overheads, as was the exhaustive proof of the 4 colour theorem and as is these NLP outputs by these colossal language models. We cannot abstract away the understanding if we do not know how all the moving pieces connect together — or perhaps we can… lets put a pin in that thought for now!

The Cast of Transformers

[Index]

There are no shortage of institutes and fully fledged companies investing time and resources into NLP and specifically into transformer models.

I took just 30 minutes to find institues and/or companies who have taken the idea of transformer models and ran with it; I found 10! From the likes of Microsoft, Google, Facebook, OpenAI, Nvidia and AI2 no less — find the full list here!

A Lack of Standards

[Index]

However, there is another problem that I want to hone in on, that is not just true for the recent developments in NLP but across the entire umbrella of artificial intelligence; a lack of standards. The end goal for this has enormous potential to break through and disrupt the digital industry from how we communicate with devices and each other on a daily basis to how creative written media is generated and a whole host more! It is no surprise that we see big companies racing against each other to be the early bird to reap the rewards. Although this new transformer architecture does produce SOTA results, it is still open to interpretation on how it is constructed. By which I mean, the corpus that it is trained on, for how long and how it is implemented. Lacking traditional software development standards means that incorporating them is wildly different for one to another — step forward Hugging Face!

Hugging Face

[Index]

Hugging Face is an NLP-specialist research and development group focused on transformer models, with its goal summed up on its website, “on a mission to solve NLP, one commit at a time.” But it is so much more than that.

We are striving to make machine learning models more explainable and accessible to users of all proficiencies by incorporating low-code and ‘bring your own data’ setups; reflected in these transformer models. These new, powerful models — even early in their lifecycle — can add significant value to business. Hugging Face understands this and so provides well documented, easy-to-implement wrappers to over 3,000 open-source NLP models in both PyTorch and TensorFlow 2.0 — popular machine learning libraries. This enables quick turn-arounds on research and development which has proved popular with… ahem, certain companies.

Apple downloaded over 45TB of Hugging Face models during a research experiment. Source: Twitter

Hugging Face understand both the benefits that transformer models bring but they also see the environmental footprint left by training, deploying and inferencing these huge models.

Let’s unpin that thought we had earlier on whether or not we can reduce the size of these models and retain their understanding whilst not knowing their inner workings, because Hugging Face has a solution.

The boffins at Hugging Face have applied a technique — known as knowledge distillation — to create smaller versions of models. This technique utilises a teacher-student architecture so the information is maintained with the student model. They have already used this on BERT and GPT2 models to create DistilBERT and DistilGPT2 respectively; reducing the size of the original BERT model by 40%, while retaining 97% of its understanding.

Hugging Face are also striving to implement common standards amongst NLP models by pushing for these models to have model cards. Model cards are small write-ups which sit alongside disclosed models to provide clarity. Clarity in the stated domains models are to be used in, performance figures within these areas and evaluation across a variety of situations that the model is anticipated to be exposed to, e.g. an NLP model that is labelled to create newspaper headlines from short descriptions of stories must state what corpus information the model was trained on and in which language(s). Google have released their own version of a model card pipeline.

Fine-Tuning BERT for Contradiction Detection

[Index]

The Applied Innovation team within Kainos wanted to experiment with transfer learning in NLP models to detect contradiction within short pieces of text. The idea was that this general contradiction detecting model could be utilised in a number of domains such as legal to flag up disagreements within service contracts.

The aim of the project was as follows:

Could we adapt a pre-trained language model to determine whether 2 pieces of text contradict each other?

The goals of the project were as follows:

Step 1: Attain a dataset with labelled text fields: contradiction or not
Step 2:Use Hugging Face’s library to get a mid-sized pre-trained language model
Step 3: Run a training job on AWS to fine-tune the base model
Step 4: Compare the performance of the base model against the fine-tuned model

Step 1: Attain contradiction dataset

I used the Stanford Natural Language Inference (SNLI) Corpus which has over 570,000 human-written English sentence pairs manually labeled as entailment, contradiction or neutral. This boded well and only required a small amount of processing to create a dataset with equal records contradiction and not contradiction.

Step 2: Download a mid-sized pre-trained general language model

I used the Hugging Face repository which — as before mentioned — contains over 3,000 transformer models. However, I was able to quickly whither the selection down to a handful of the more popular ones before settling on BERT.

I chose BERT for several reasons:

It had options for cased and uncased models
It had various sized models — allowing me to reduce my ML footprint
It had decoupled tokeniser, config and vocab implementations
It allows for ‘heads’ to be switched out easily
It’s implementation was well-documented
It’s a fan favourite — and allows me to use pictures of BERT from now on!

The pre-trained general language model selected was BERT. Source: TechViz

The first step involved using the Hugging Face transformers Python SDK to load in a pre-trained BERT model and tokenizer — a doddle when you have excellent documentation.

I saved this locally to minimise bandwidth calls using PyTorch’s model save function and similarly their load function to retrieve it again. However, since this POC was going to use AWS, the model components were also saved to S3.

A subset of the s3_utils.py file that deals with saving the model components to S3.

Step 3: Configure the model and run it on AWS as a SageMaker training job

Conscious of a point brought up by this blog around ML footprints, the training dataset was reduced to just 20,000 records (plenty to achieve the aim of this project).

Loading the vocab file from S3 and providing it to the decoupled tokenizer proved a problem:

Currently, the implementation for loading a tokenizer using the class BERTTokenizer only supports local file references, however parsing from an S3 bucket returns the object, i.e. the vocab, as a string.

Inspecting the code, it was reasonable to change it to accommodate the string as shown in the code snippet below from line 193 in the tokenization_bert.py file.

Changes made to Hugging Face’s tokenization_bert.py file to enable loading the vocab from a string.

The project was containerised using Docker and pushed into an existing AWS ECR repository.

docker tag <IMG NAME> <ECR REPO URL>docker push <ECR REPO URL>

To initiate a SageMaker training job a custom estimator had to be created and called.

SageMaker estimator used to run the custom image found in ECR.

An instance with over 16GB of memory was required for the training job — as such an ml.m5.2xlarge instance type was used.

Running this instance took just over 1 hour for 2 epochs and achieved a minimum loss of 0.642.

The trained model components were saved to S3.

Step 4: Compare the test accuracy of the base model against the fine-tuned one

Running the test dataset of 5,000 records against the pre-trained BERT model and the fine-tuned one resulted in the following:

The fine-tuned BERT model out performs the original. Source: Giphy

Pre-trained model accuracy: ⠀⠀⠀⠀⠀⠀50.94%

Fine-tuned model accuracy: ⠀⠀⠀⠀⠀⠀56.14%

Let’s not get too ahead of ourselves it’s a small margin at best, so why is that? Well, there are 2 main reasons:

Training size — 20,000 records just isn’t that many, using more of our original 300,000 records could help.
Complex task — Detecting contradiction is far more than simply detecting negating words or identifying antonyms, it heavily relies on context which these models struggle to infer; recognised by Stanford researchers.

Whilst these numbers were not fantastic, the team gained value and experience from it by achieving the project’s aim of:

Predicting whether 2 pieces of text contradicted each other using a pre-trained language model

It also validated the use of Hugging Face as a quick development resource capable of producing POC standard results with minimal cost and time investment.

The Applied Innovation team identifies this chaotic and disparate development as a problem within the area of artificial intelligence and have researched topics such as how to open up the black boxes of machine learning models and the area of low-code/no-code applications.

If this blog has interested you — check out the Innovation team’s publication below!

Who’s Transforming

[Index]

Microsoft: Turing Natural Language Generator (T-NLG)

Turing-NLG: A 17-billion-parameter language model by Microsoft — Microsoft Research

This figure was adapted from a similar image published in DistilBERT. Turing Natural Language Generation (T-NLG) is a…

www.microsoft.com

Google: Bi-directional Encoder Representations from Transformers (BERT)

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a…

ai.googleblog.com

OpenAI: Generative Pre-Training (GPT-2)

Better Language Models and Their Implications

We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves…

openai.com

OpenAI: Generative Pre-Training (GPT-3)

Microsoft teams up with OpenAI to exclusively license GPT-3 language model — The Official Microsoft…

One of the most gratifying parts of my job at Microsoft is being able to witness and influence the intersection of…

blogs.microsoft.com

Facebook: Robustly optimised BERT approach (RoBERTa)

RoBERTa: An optimized method for pretraining self-supervised NLP systems

A robustly optimized method for pretraining natural language processing (NLP) systems that improves on Bidirectional…

ai.facebook.com

Google and Carnegie Mellon University: XLNet

Understanding XLNet

XLNet is the latest and greatest model to emerge from the booming field of Natural Language Processing (NLP). The XLNet…

www.borealisai.com

Google: Text-to-Text Transfer Transformer (T5)

Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer

Over the past few years, transfer learning has led to a new wave of state-of-the-art results in natural language…

ai.googleblog.com

AI2: Embeddings from Language Models (ELMo)

AllenNLP

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g…

allennlp.org

Nvidia: Megatron

MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism…

nv-adlr.github.io

Google: A Lite BERT (ALBERT)

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

Ever since the advent of BERT a year ago, natural language research has embraced a new paradigm, leveraging large…

ai.googleblog.com

Solving NLP One Hug at a Time

Index

Along Came A Transformer

Transformers: Stuck Like GLUE At The Top

It’s Not Easy Being A Transformer

Does Bigger, Always Mean Better?

The Cast of Transformers

A Lack of Standards

Hugging Face

Fine-Tuning BERT for Contradiction Detection

Who’s Transforming

Turing-NLG: A 17-billion-parameter language model by Microsoft — Microsoft Research

This figure was adapted from a similar image published in DistilBERT. Turing Natural Language Generation (T-NLG) is a…

Open Sourcing BERT: State-of-the-Art Pre-training for Natural Language Processing

One of the biggest challenges in natural language processing (NLP) is the shortage of training data. Because NLP is a…

Better Language Models and Their Implications

We’ve trained a large-scale unsupervised language model which generates coherent paragraphs of text, achieves…

Microsoft teams up with OpenAI to exclusively license GPT-3 language model — The Official Microsoft…

One of the most gratifying parts of my job at Microsoft is being able to witness and influence the intersection of…

RoBERTa: An optimized method for pretraining self-supervised NLP systems

A robustly optimized method for pretraining natural language processing (NLP) systems that improves on Bidirectional…

Understanding XLNet

XLNet is the latest and greatest model to emerge from the booming field of Natural Language Processing (NLP). The XLNet…

Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer

Over the past few years, transfer learning has led to a new wave of state-of-the-art results in natural language…

AllenNLP

ELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g…

MegatronLM: Training Billion+ Parameter Language Models Using GPU Model Parallelism

We train an 8.3 billion parameter transformer language model with 8-way model parallelism and 64-way data parallelism…

ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations

Ever since the advent of BERT a year ago, natural language research has embraced a new paradigm, leveraging large…

Written by Marc Templeton