Reflections from a workshop on transformers and NLP

This blog summarises my learnings from a workshop conducted by NVIDIA on leveraging transformers to build NLP applications. Do note that this is not a sponsored post, just a reflection of what I happened to participate in!

Recently, I happened to attend a workshop on leveraging transformers to build NLP applications. Working in the NLP space, I realized that this workshop would be relevant for my professional growth and decided to give it a go. Must say, I have not been let down, as it was a vigorous and intensive one.

Overview of the course

NLP, or Natural Language Processing, pertains to a family of techniques that enable us to extract value out of text data. The need to process textual data has been on the rise alongside the internet and social media era. It has only grown to cater to other related areas such as voice assistants, where NLP is applied during post speech processing through machine translation, sentiment analysis, etc. The course talked about the evolution of NLP, particularly focusing on how transformers came to be a revolutionary component of the recent times, in paving a way to develop NLP applications nearing (or even surpassing) human accuracy and better latency.

The course, conducted by Adam Grzywaczewski, Senior Deep Learning Solution Architect, NVIDIA , consisted of three parts, talking about the evolution of NLP, how BERT and transformers revolutionised the NLP realm and how to deploy large scale NLP workloads. For those who are interested in finding out more details, there is a course factsheet containing an overview that can be found here.

For more courses offered by the NVIDIA Deep Learning Institute, I recommend visiting https://www.nvidia.com/en-in/training/ to find out more about self-paced and instructor led courses.

Below, I have compiled my learnings on the evolution of text representations and how that led to the advent of techniques such as BERT. Do read on!

The first part focused on leveraging Machine Learning techniques in NLP. It talked about how to represent texts, and how to leverage ML to derive value out of such representations. Broadly, NLP consists of these two facets (Text representations and the corresponding ML algorithms) below, as depicted.

Image courtesy: NVIDIA Deep Learning Institute

At each step of NLP modelling, there are numerous design decisions that need to be taken, such as how to represent tokens, determing the length of the representation, etc. The below section outlines how such techniques have evolved over time.

Pitfalls of one hot encodings for text

One of the first aspects to consider when modelling text data, is how to represent it in a machine readable form. The earliest approaches involved encoding text as one-hot vectors. This led to significant memory issues, since most corpus contained limited vocabulary and led to sparsity problems (p>>n, where p the number of tokens which tend to significantly exceed the ’n’ number of training examples) .

Image courtesy; NVIDIA Deep Learning Institute. For the two sentences, although there is very little difference in semantics, one can imagine the sparsity problem introduced by such representations.

Also, for similar words, such as plurals of the same term such as ‘cat’ and ‘cats’ the encodings were unable to capture semantic elements. As a next step, the count of words in a document were taken. This too suffers the same pitfalls.

TF-IDF

Term Frequency, Inverse Document Frequency (or TF-IDF in short in the NLP space) is another mathematical technique that relies on penalizing words that are commonly appearing in the corpus. While this leads to some distillation leading to rarer words, this approach too does not capture elements of meaning. Read more on TF IDF here.

Distributed representations of text

To mitigate the issues above, new approaches based on distributed representations of text started to emerge.

This was motivated by the quote,

‘You can tell a word by the company it keeps’

- from J.R. Firth

Take an example of the word “shorting”. In a generic sense, shorting is not commonly used, but in the financial domain, phrases such as “shorting of a stock” are commonplace. Hence, words started to be represented considering the context they appear in. Techniques that relied on capturing the co-occurrence of words, i.e., within a set of documents, what words co-occurred with one another and representing them in a matrix form such as the below started to emerge.

Image courtesy: NVIDIA Deep Learning Institute; A matrix of tokens where the number indicates the token co-occurs with another in a set of documents

Representations like the above, help to leverage the semantic relationships of tokens rather than syntactic. E.g. “The cat sat on the mat” vs “The __ sat on the mat”, in this case representations such as the above would help us understand, ‘cat’ and ‘dog’ might be co-occurring often, as opposed to say, ‘cat’ and ‘computer’.

There is a need for compact and computationally efficient representations. Dimensionality reduction techniques were needed to achieve this, and also to get robust notions of distance exposing the info captured by the distributional representation. This could also ensure that similar documents could be closer to each other.

LSA- Latent Semantic Analysis

Latent Semantic Analysis broke down the co-occurrence matrix into sub components using mathematical techniques such as truncated SVD, where the ‘k’ denoted the top K largest singular values.

Image courtesy: NVIDIA Deep Learning Institute
Image courtesy: https://en.wikipedia.org/wiki/File:Topic_model_scheme.webm The video shows how LSA helps align the documents containing words with similar meaning closer to each other, achieving the two goals of grouping similar meaning words together, along with similar documents together

Drawbacks of LSA

It is still an expensive ask, for modern NLP matrices can be in the range of 2 billion, etc. Choosing K is a hyper parameter tuning task, and determination of which “K” is adequate leads to the question of trade off of how much information needs to be dropped when doing so. Stability at extreme K (too small/too large) is also an issue.

Neural Network based approaches: Word2Vec, GloVe

The Alex Net moment in computer vision along with the rise in compute meant Neural Networks could now be leveraged for NLP tasks, which was previously not possible in the early 90s (when auto encoders came to light, but couldn’t be implemented due to compute constraints).

Word2Vec, put forth by Mikolov, et.al was one of the simplest conceivable Neural Networks, with a single layer, no non-linearities and just weight and bias terms. The objective here was to learn the embeddings layer, and discard everything else, the task was meant to do. This embeddings layer for different tokens act as a look up table of their representations.

Image courtesy: NVIDIA Deep Learning Institute; Word2Vec leverages a simple NN to learn the weights layer whose length can be a hyperparameter.

This led to the two common approaches called as Continuous Bag of Words and Skip-gram (given the input word, predict the surrounding word). The embeddings learnt by leveraging the textual sequence could embed various dimensions of meaning, leading to an upgrade from the era of one hot encodings for text. A key property of the Word2Vec algorithm is that it is unsupervised (or more precisely, self-supervised). This means that it can be trained with huge corpora and generate very useful embedding vectors.

Glove was next, which enabled learning the embeddings directly, by merging the ability to tap into the co-occurrence (statistical relations) and Neural Network’s capabilities to scale across different corpus. When learning vectors based on co-occurrence, it not only started to learn synonyms, but meaningful representations that could even identify antonyms. The team behind GloVE has done path breaking work in the NLP space. Do check out Stanford’s CS 224N course on NLP, which is an excellent resource for learning NLP.

Now that the text representations have been obtained, how to leverage them?

“Attention is all you need”

With the Neural Network approach leading to efficient representations, it was natural that they could be extended to perform the actual modelling task at hand, be it machine translation, sentiment analysis, etc. Vanilla RNNs was the first approach tried, but with text data being inherently sequential, larger sequences led to problems of vanishing and exploding gradients. LSTMs offset this to an extent, by using a concept of a memory cell, and gates. However, these still meant that the text had to be processed sequentially. With growing corpus of data in the text domain, processing the data sequentially was a huge bottleneck.

Enter the idea of attention.

There is a common notion that the average human attention span is around 8 seconds. In general, when humans type text, their attention span seems quite long too. This is true of cases where there will be a co-reference made to an earlier entity mentioned in the text, after a long set of tokens.

E.g. “Harris liked eating at the cafe typically in the mornings as that meant he could finish his jog at the park and then have a good meal”

In the above sentence, notice how we as humans tend to refer to an earlier entity, such as Harris, so much later in the sequence. The challenge for Neural Networks is to capture this inherent sequence, without losing track of all that occurs in between.

The paper, Attention is all you need, leverages on this human intuition, and put forward the technique called ‘attention’ to observe the information from an entire span of sentences and conserve it. This link is a brilliant illustration of the Attention mechanism. It helped me understand what attention was all about.

The paper introduced the foundations of what is now famously used as the transformers model. The transformers could be trained significantly faster than either a CNN or an RNN, because they lend themselves well to parallelism. More variations continue to be developed, with the advent of self-attention and multi-head attention, etc. Learn more about the transformer architecture here.

In a nutshell, transformers are made of an encoder and decoder block, which are connected, but do not share any weights. Since each layer is stacked one upon the other, it allows for parallel processing as seen in this brilliant illustration from https://jalammar.github.io/illustrated-transformer/

Image courtesy: https://jalammar.github.io/illustrated-transformer/

There is no doubt, that transformers were a revolutionary introduction to the NLP space, which later led to Google, releasing BERT , Bi-Directional Encoder Representations from Transformers, leading to significant advancements in the NLP realm post 2018.

There were undoubtedly many more learnings on the evolution of NLP that came by through this workshop , but I’ll wrap up this post with some key takeaways.

1. There have been a lot of improvements post the RNN era, all the way onto transformers, thanks to increased compute processing power, and larger datasets.

2. The process of training such models performing cutting edge NLP tasks is still inherently prone to human bias, and that is something that needs to be handled with utmost diligence no matter what domain one might be working in.

3. Training models of the size of BERT, GPT, etc. do need to take into consideration the environmental footprint they leave behind.

Image courtesy: https://www.technologyreview.com/2019/06/06/239031/training-a-single-ai-model-can-emit-as-much-carbon-as-five-cars-in-their-lifetimes/

4. Often latency is critical in business applications, especially in the medical domain. Post training, optimisations need to be done to the model. Some techniques that can help in this regard, are model inference code optimisations , model compression, quantization and knowledge distillation.

Thanks and Cheers,

Kishan

--

--