Achieving State-Of-The-Art Results In Natural Language Processing

Part 2: ELMo, BERT and MT-DNN

Published in

Bumble Tech

5 min readApr 24, 2019

In my previous article I covered some of the fundamental concepts in NLP. This time, I will introduce ELMo, BERT and MT-DNN, architectures which leverage these concepts and have recently achieved some State Of The Art (SOTA) results on NLP tasks.

Embeddings from Language Models (ELMo)

An introduction to ELMo can be found in this paper. ELMo aims to provide an improved word representation for NLP tasks in different contexts by producing multiple word embeddings per single word, across different scenarios. In the example below, the word “minute” has multiple meanings (homonyms) so gets represented by multiple embeddings with ELMo. However, with other models such as GloVe, each instance would have the same representation regardless of its context.

ELMo uses a bidirectional language model (biLM) to learn both word and linguistic context. At each word, the internal states from both the forward and backward pass are concatenated to produce an intermediate word vector. As such, it is the model’s bidirectional nature that gives it a hint not only as to the next word in a sentence but also the words that came before.

Another feature of ELMo is that it uses language models comprised of multiple layers, forming a multilayer RNN. The intermediate word vector produced by layer 1 is fed up to layer 2. The more layers that are present in the model, the more the internal states get processed and as such represent more abstract semantics such as topics and sentiment. By contrast, lower layers represent less abstract semantics such as short phrases or parts of speech.

In order to compute the word embeddings that get fed into the first layer of the biLM, ELMo uses a character-based CNN. The input is computed purely from combinations of characters within a word. This has two key benefits:

It is able to form representations of words external to the vocabulary it was trained on. For example, the model could determine that “Mum” and “Mummy” are somewhat related before even considering the context in which they are used. This is particularly useful for us at Badoo as it can help detect misspelled words through context.
It continues to perform well when it encounters a word that was absent from the training dataset.

Bidirectional Encoder Representations from Transformers (BERT)

In a recent paper published by Google, BERT incorporates an attention mechanism (transformer) that learns contextual relations between words in text. Unlike bidirectional models such as ELMo, where the text input is read sequentially (left-to-right or right-to-left), here the entire sequence of words is read at once: one could actually describe BERT as non-directional.

Essentially, BERT is a trained transformer encoder stack where results are passed up from one encoder to the next.

At each encoder, self-attention is applied and this helps the encoder to look at other words in the input sentence as it encodes each specific word, so helping it to learn correlations between the words. These results then pass through a feed-forward network.

BERT was trained on Wikipedia text data and uses masked modelling rather than sequential modelling during training. It masks 15% of the words in each sequence and tries to predict the original value based on the context. This involves the following:

Adding a classification layer on top of the encoder output.
Multiplying the output vectors by the embedding matrix, transforming them into the vocabulary dimension.
Calculating the probability of each word in the vocabulary with softmax.

As the BERT loss function only takes into consideration the prediction of the masked values, so converging more slowly than directional models. This drawback, however, is offset by its increased awareness of context.

Multi-Task Deep Neural Networks (MT-DNN)

MT-DNN extends the function of BERT to achieve even better results in NLP problems. It uses multi-task (parallel) learning instead of transfer (sequential) learning by computing the loss across a different task, while simultaneously applying the change to the model.

As with BERT, MT-DNN tokenises sentences and transforms them into the initial word, segment and position embeddings. The multi-directional transformer is then used to learn the contextual word embeddings.

Architecture of MT-DNN: Liu et al., 2019

Summary

So, now you know a bit more about the latest developments in the field of NLP that have led to the achievement of some SOTA results.

I hope this series of articles has motivated you to find out more about how deep-learning architectures can be applied to NLP tasks and gives you some inspiration as to where they could drive innovations in your workplace.

If you’ve enjoyed reading this and are keen to work on similarly interesting projects, we are looking to expand our Data Science team at Badoo, so please do get in touch! 🙂

If you have any suggestions/feedback, feel free to comment below.