What You Need to Know About Natural Language Processing

Our monthly analysis on machine learning trends

This post was originally sent as our monthly newsletter about trends in machine learning and artificial intelligence. If you’d like these analyses delivered directly to your inbox, subscribe here!

Not so long ago, it seemed like all the really impressive applications of deep learning were occurring in computer vision. Face recognition. Surgical devices. Self-driving cars. Other applications, like natural language processing (NLP), appeared to be a few steps behind.

This was partly the result of there being more initial interest and research in computer vision and partly just a statement of the complexity of human language. There’s somewhere around a million words in English and an infinite number of possible sentences. Meanwhile, if you want to really understand language, you need to be able to make sense of the long-term relationships between words, both within and across those sentences. There’s a reason it took human language so long to evolve. You can imagine the difficulties involved in engineering all of this from scratch.

Over the last couple of years, though, something pretty exciting has been happening in the world of deep learning. NLP has started to catch up to computer vision. Suddenly, we’re approaching human-level accuracy on language tasks that machines could barely handle a decade ago. Grammatical error correction is almost there. Machine translation has begun to match human translators in some respects (and is at least making a go at competing more broadly). Meanwhile, the state of the art for other key NLP tasks such as summarization, question answering, and text classification is continually improving.

How did NLP make these leaps? What are its current frontiers and future possibilities? And how is this all going to affect commercial applications as well as our daily lives? We’ll be addressing each of these questions in this newsletter. After all, as the philosopher Ludwig Wittgenstein said, “The limits of my language means the limits of my world.”

The Computational Language Instinct

We often take it for granted just how complex human language is. Not only does it incorporate a vast array of hierarchies and dependencies, but it also tends to be full of metaphors, abstractions, and sometimes even outright lies. So what exactly does language look like from the perspective of an algorithm?

For the first half of the past decade, many NLP approaches involved a major simplifying procedure. Ignore the fact that language is sequential and just treat the words in a document as a collection without any specific order. We call this the bag-of-words model. While it threw away almost all the complex interactions and dependencies that occur between words, it also had some undeniable benefits. It made computation vastly easier and it was still pretty effective if you wanted to classify documents or cluster them based on the distribution of words.

Treating language this way has clear limitations, however. For one thing, every word is equally different from (or, conversely, similar to) every other word. We know this isn’t actually how language works. Some words can be substituted for one another in a sentence and others can’t. Consequently, the next big innovation in NLP was context.

Back in 2013, a team at Google came up with word2vec, a relatively simple and computationally efficient model for adding context to representations of words. Here are the basics: Run a lot of text through a shallow neural network (just one hidden layer) and, for each word in the input, learn to predict the other words nearby. Once the model has learned to do this well, you don’t actually do anything with the predictions themselves. Instead, you take the weights from the hidden layer — with each row in the weight matrix being associated with a different word token in your input — so that you have dense vectors for representing each word, which in turn encode lots of really useful information. Part of this useful information is that similar words (those that show up in similar contexts) end up with comparable vector representations. Plus, these embeddings also end up encoding compelling syntactic and semantic relationships.

word2vec embeddings encode information about syntax and semantics via the distance between points in vector space.

Getting dense vector representations of words has become an essential first step for most NLP tasks. But it’s just the beginning of what made deep learning on text feasible.

Let’s take a quick look at some other developments:

  • There was the use of recurrent neural networks (RNNs), a deep learning architecture designed specifically to deal with sequential inputs.
  • Next came bidirectional RNNs, which instead of just processing the words in a sentence from left to right, also moved from right to left, allowing later words to help disambiguate the meaning of earlier words and phrases.
  • Then there was the realization that standard RNNs had inherent limitations when it came to dealing with language, since they weren’t very good at “remembering” previous words from earlier in the sentence (or from previous sentences).
  • That led to the development of memory cells such as gated recurrent units (GRUs) and long short-term memory cells (LSTMs). These additions were able to account for the long-term dependencies between words by maintaining and propagating useful information across the different time steps of an RNN, while “forgetting” less useful information (helping to solve what’s called the vanishing gradient problem).
An illustration of a bidirectional LSTM network.

Finally, and perhaps most exciting of all, attention arrived on the scene. By adding attention, models not only had the ability to remember important information from one time step to another (already accomplished thanks to LSTMs and GRUs), but they could now take the whole output of the RNN (the outputs of all the different time steps combined together) and only focus on those parts that were most relevant to the task at hand. Attention accomplishes this by creating a probability distribution over the words, directing the model to look at the words that are most salient.

This has turned out to be a really important innovation for pushing NLP to its recent cutting-edge capabilities. We’ll get back to attention in a bit, as well as going into more detail about the research that has grown out of these developments. Before doing that though, it’s worth first taking a step back and clarifying why all of this is important.

So Why Does Any of This Matter?

In the previous section, we showed how NLP moved from processing unordered collections of words to developing really complex models that could account for long-term syntactic and semantic dependencies. It’s easy to underestimate just how enormous a jump this was. So where exactly has all of this gotten us?

For one thing, it helps businesses understand customers better or serve them more seamlessly. Having more powerful models of language means that tasks that computers could barely handle a few years ago are now feasible. You can mine and summarize user comments, measure their sentiment, extract named entities to see what other products and companies are being discussed and how they’re related, translate between different languages, automate question answering with chatbots, and classify documents and responses based on various metrics of similarity and difference. The list goes on.

Most compelling of all, there are now individual models being developed that can generalize to accomplish all of these tasks, meaning a potentially vast decrease in the amount of time and effort it previously took when multiple different models were needed. Here at integrate.ai, we’re exploring how the architectures used to make sense of language can help us identify actionable signals in all sorts of time series data so enterprises can move to a probabilistic operating model.

The Current State of NLP Research

The two developments that have had the most impact on NLP in the last couple years are attention and more versatile embeddings. In fact, as we’ll see shortly, the two have recently even started to converge.

If you’re wondering just how powerful attention is, the answer became clear last year when a team at Google released a paper demonstrating a new kind of network that entirely did away with standard RNN and CNN architectures, instead just having attention do all the heavy lifting.

An example of self-attention, in which each word attends to all the other words in the same sentence (darker lines indicate stronger attention connections). From the original Transformer network paper.

Whereas in the past, attention was usually used in combination with an RNN/LSTM architecture, the aptly named Transformer network was able to get state-of-the-art results on machine translation tasks while avoiding the multiple time steps that RNNs require during the encoding phase. Not only did this make the network more computationally efficient to run, but it also allowed for computing on longer spans of words (over long sequences, the effectiveness of LSTMs begin to degrade). Getting these kinds of results with just attention indicated that it was an even more powerful approach than researchers initially realized.

As we saw earlier with word2vec, creating dense vector representations of words has become an essential building block for most NLP tasks. Along the way, there have also been other effective approaches to computing these embeddings, such as GloVe and FastText. However, an even bigger leap forward in embeddings occurred this year with the release of ELMo (Embeddings from Language Models). Unlike these earlier approaches, the ELMo approach employs a deep neural network (more specifically a bidirectional LSTM), to create more complex representations of words, ones which encode a lot more information about context, syntax, and semantics, in turn disambiguating different uses of the same word. Just incorporating these embeddings, without changing any other aspects of existing systems, immediately provided a pretty incredible boost to overall performance.

Meanwhile, earlier this month, an even more promising approach to generating embeddings hit the scene: BERT (Bidirectional Encoder Representations from Transformers). BERT — and yes, you are noticing an endearing trend in acronym naming conventions — swapped ELMo’s LSTM out and replaced it with a Transformer network, bringing us back to that attention and embeddings convergence mentioned earlier.

Using a Transformer to compute embeddings had actually already been implemented by OpenAI a few months ago. However, by changing the training objective along with various other details, BERT has managed to attain some pretty spectacular results, garnering some well-deserved excitement. The downstream applications of these improvements are as broad as the NLP field itself.

Summing Up

The general consensus is that NLP is now on the verge of some very big things. Between attention, extremely versatile embeddings, and more transferrable and generalizable models, it’s starting to feel like NLP is no longer playing catch-up with computer vision. With each innovation in the field, the gap between computers and humans is narrowing a bit. The result will almost certainly be a revolution in understanding for companies and customers, as well as for our own abilities to make sense of our increasingly text-based lives.