Make Language Processing Natural Again

Sesame Streets and the Transformers are not threats to humanity

Published in

Data Folks Indonesia

8 min readOct 30, 2020

Anyone who try to catch up with the world of Natural Language Processing (NLP) may notice something strange. The researchers in the field casually borrowed Sesame Street characters, such as Bert and Elmo, and put them side-by-side to an image of Optimus Prime. What is happening here?

In NLP, BERT is an abbreviation for Bidirectional Encoder Representations from Transformers. What was that? Do not worry, we will get back to that. And ELMo? Embeddings from Language Models. I agree those names sound like unnatural concepts. To be honest, I am not even sure why NLP uses Sesame Street’s character names. The trick is to ignore the weird naming for a while. You will see that the ideas are actually easy to follow.

Why is it important for more people to understand the NLP concepts? Recently, an NLP model called GPT-3 — finally, not a Sesame Street character name, but still related — has taken the internet by a storm due to its inconceivable performance. I admit it is convincing (check out this short movie script). Yet, many people took it to a whole new level, even started calling it a threat to humanity. Spoiler alert, it is not even close. Making it look magical may be beneficial for marketing purposes. But people tend to overestimate the impact in the long run. When you understand what is really going on, you will find no reason to worry.

Let me take you to a journey through four hand-picked NLP breakthroughs: word embedding, sequence models, attention models, and language modelling. My intention is to explain them as simple as possible. Without daunting formulas nor alien terminologies. I promise.

Word Embedding

Let’s start with what I believe is the most impactful breakthrough in NLP. Word embedding successfully answers one of the toughest question: how should we represent words in computers? We know computers are good with numbers, so let’s change words to numbers! But what kind of numbers?

What embedding proposing is to represent words with multiple numbers, each specifying different characteristics of a concept. Let’s take pink and elephant as examples. Pink can have a high number for (a characteristic of) being a color and another very low number for animal, while elephant should be the other way around. Now, if we have the embedding for other words such as red and panda, computers can compare the embedding and understand that pink is more similar to red than panda. Expand the characteristics as much as you want, and you have what we currently call as embedding.

An example of three word embedding with two numbers each. The numbers are usually between 0–1, but not exactly 0 or 1 because of noise and avoiding over-confident.

Seeing how successful word embedding had become, researchers started to use similar technique all over the places: sentence embedding, paragraph embedding, topic embedding,… basically anything. In the end, the main idea remains the same: when you see the word “embedding”, just think about a bunch of numbers representing a concept.

Sequential Models

Embedding is the de facto solution used by many researchers to represent words, but how about sentences? The first light on answering it came when we realize something about our language: a pink elephant is correct, yet a elephant pink is not. Why? Because in languages, sequence matters.

Before proceeding with further explanation on how sequence plays a role, we have to visit the meaning of “model” in NLP sense. Generally, a model is a simpler version of something else. For NLP, we create models to complete natural language-related tasks, mimicking how humans solve it. For instance, we can build an NLP model to summarize a book or chat with a person. You can think of models like tiny (and much dumber) brains.

Similar to humans, those brains need to learn. In the early years, the common approach was to feed NLP models with a simple list containing what words exist in the input. Coming back to a pink elephant, the model will just learn that there are three words: a, pink, and elephant. With this approach, the ordering information is lost. As expected, the performance was nowhere near human level.

The big leap happened when researchers started to incorporate sequence information into the model. Rather than just a list of words, it knows which one is the first, the second, and so forth. As an example with a pink elephant, when it sees the word elephant, it has learned that the preceding word is pink, and the beginning is a. Additionally, sequential models work much better when combined with embedding, like a match made in heaven. We can use the corresponding embedding of each word and feed that sequentially for the model to learn.

Sequential models consume word embedding one-by-one from front-to-back (or back-to-front sometimes). By doing so, it can learn word positions as additional information.

Attention Models

After torturing sequence models for years with all kinds of text, researchers started to notice that the performance becomes worse when the text is longer. Take a look at these two simple sentences:

Please take the root of the number
Please take the root of the tree

In both sentences, you can see the word root is surrounded by similar words, but the most important words are the last ones. With sequence models, we imply that the most important words tend to be closer, which is not always the case. As it turns out, sequence matters because it provides context, but the most important context can be far apart in a long sentence.

To cover longer sentences, we should increase the model’s attention span. Rather than consuming the words one-by-one in order, the models are given a chunk of text at once and the corresponding word positions. Subsequently, we let the model determines which words are important (i.e., which words to pay attention to). As it turns out, computers prefer reading by chunks too, because their tiny brains can read multiple chunks at once. Therefore, not only attention mechanism bring more context to the model, it learns faster too.

Instead of consuming one-by-one, attention models read a chunk of text alongside the word positions (shown as the number above embedding values). The different arrow sizes represent the *attention scale to each word. For instance, th*icker arrow size *for* ***pink*** *and* ***elephant*** *means they are more relevant than* a.

Currently, attention-based models dominate NLP tasks due to their performance. The most famous implementation of attention model is the transformer, which consists of attention models with extra steps. Remember what does T in BERT stand for? Yes, Transformers. In other words, BERT is a variance of attention-based models. The difference among various attention-based models lies on what the model performs before and after the attention steps. Can you guess what does T in GPT-3 stand for?

Language Modeling

It is tempting to end with attention models, but there is one more breakthrough that is rather underappreciated in NLP. As I have mentioned above, a model’s goal is to solve a task. Language Modeling (LM) is a type of task where a model has to predict missing words from a given text. The model learns by comparing its predictions with the correct words and then adjust accordingly for the future.

To learn how to solve the defined task, we have to provide the model with examples. If you want to build a chat-bot, you need to collect an abundance of questions and their corresponding answers for the model to learn; for summarization, the long texts and their sample summaries. This method is the essence of training a model. The more data you provide to the model, its performance tend to get better.

LM is special because of the minimum human involvement required for preparing the data. Meanwhile, LM needs nothing but the text itself. The model can remove words at random by themselves and tries to predict them subsequently. Now, imagine all the available text on the internet. That is our LM dataset.

Apparently, when a model has read an abundance of text, it has a pretty good idea on how words relate to each other. The breakthrough comes from realizing how we can leverage this knowledge to other NLP tasks. Consequently, we do not need to train a model from scratch every time we have a new task. We can use pretrained models, e.g. BERT or GPT trained for LM, and provide much less examples. If you think about it, this approach is similar to humans. When we learn a new language, we can use this knowledge to all kinds of language tasks. We do not need to start all over again for every task.

Coming back to ELMo, which stands for Embedding from Language Models, you should get a rough idea what it is now. As the name stated, it is a method to learn word embeddings based on language modeling task. If two words tend to be interchangeable in multiple contexts, their meaning should be similar. And how do we define similar? That is correct, their embedding numbers should be close to each other. It is not that complicated, isn’t it?

We tend to be afraid by what we do not understand

Ten years ago we were not even sure how to represent a word. Now, we have computers answering philosophical questions. Computers have successfully fooled us by writing pieces that are indistinguishable from human. After reading the four concepts, you should understand how: by computing probability based on surrounding words, learning which one to pay attention to, and then placing one additional word at a time. Just like humans. We write by picking the best words to capture our intention, one word at a time.

Making NLP sounds sophisticated and out-of-reach is not the way. NLP is far from solved and there is still plenty of work to do. If we can help wider audience to understand, we can get more hands to chip in. Finger-crossed.