Does a Computer Understand Me?

Natural Language Processing — Advances in 2018

Sana Tariq

Published in

OPUS

4 min readJan 2, 2019

The field of natural language processing (NLP) made huge strides in 2018.

A total of 49, 412 publications went live, of which 30, 203 were research articles.
$1.9 billion US dollars were collected in worldwide revenue including hardware, software, and services.

Major conferences (ACL, COLING, and EMNLP) attracted scientists to contribute to areas such as machine learning for NLP, machine translation, NLP applications, sentiment analysis, social media etc.

Institutes like AllenNLP and GoogleAI were hard at work with advances like ELMo and BERT, respectively.

So where does the field stand currently?

Let’s take a step back and talk about the concept of embeddings.

Computers understand numbers, which is why image and audio processing is easier than text-based analyses. Pixel-intensities for image data and spectral density coefficients for audio data provide meaningful information through their numerical embeddings, making it easier for computers to parse through the data.

Language analyses such as NLP prove slightly trickier because assigning an arbitrary number to the word “tree” versus “chair” would offer no important information about their relationship to each other. This is where the concept of word embeddings comes in: words or phrases from the vocabulary are mapped to vectors of real numbers based on their relationship to each other.

word2vec

One such model is the word2vec model, which is a two-layer neural net that parses text. It is a mathematical way of understanding similarities. The input is a large body of text and the output is a set of vectors, where words that share a common context show a closer mathematical relationship than words that are not related.

Let’s look at the classic example of mapping the word “France” against other countries and measuring the word cosine distance. The output numbers range from 0 to 1, where 1 equals complete overlap and 0 equals no overlap. (Mapping France with France would equal 1).

However, such a model only works with single word embeddings. For a model that understands the entire body of text and generates a contextual representation of words based on other words in a sentence, we require more sophisticated methods.

Enter ELMo…

Embeddings from Language Model (ELMo)

ELMo takes word embeddings a step farther by using a deep neural network framework for training. It embeds words into real vector space and is bidirectional in nature meaning it predicts next words in a sequence of words using both the knowledge of previous words and the next words.

Using ELMo as a pre-trained embedding reduces the need for training data significantly. But even though ELMo is a step up from word2vec, it is still very shallow when it comes to directionality because it simply concatenates i.e. joins the words from left-to-right and right-to-left but the representation as a whole isn’t built simultaneously.

This is where BERT improves things.

Bidirectional Encoder Representations from Transformers (BERT)

BERT also uses bidirectional training but allows for representations to be built from both left-to-right and right-to-left simultaneously. To do so, BERT uses a masking technique to ensure that the word being predicted doesn’t accidentally “see itself” in a multi-layer model.

For example, let’s say I wanted to predict the word “friend” and “together” in the sentences below.

Original: She is meeting a friend. They are going to see a movie together.
Input: She is meeting a [Hide1]. They are going to see a movie [Hide2].

If BERT can see the words that come before, and the words that follow after, it would defy the purpose of its prediction. Instead, it uses a mask to hide the words and then conditions each word bidirectionally to predict the masked words.

BERT is the first successful pre-training of a deep neural network using masking, even though the concept has been around since 1953!

Today, these pretrained models are opensource, cut down the need for time-consuming pretraining from scratch, and allow task-specific customization for various NLP and machine learning applications such as machine translation (think Google Translate), automatic summarization (hello next generation of SparkNotes!), speech recognition (Alexa and Siri, anyone?) and many more.