What is NLP
Natural Language Processing (NLP) is a method to extract meaning from human language. It uses research from computer science, linguistics, and artificial intelligence. It often involves performing a task or making a job easier for a human to do.
NLP In real life
You will have already encountered many kinds of NLP in action. Here are a few examples:
- Searching for something online and having Google complete your search query for you
- Voice assistant — like Alexa, Siri, Cortana, Google … able to ask questions, and receive an intelligent response
- Spelling and grammar correction
- Translating from one language to another, say English to Spanish, automatically without the need for human intervention.
So, does a computer actually understand language? Well, not really. We need to provide the language to the model in a way that the machine can process it. In practice, that means converting the words into numbers. In machine learning, we often represent this by vectors, so before we look at some examples, let’s do a quick refresher on vectors.
A definition of vector is a quantity that has magnitude and direction, commonly represented by a line segment.
If we make up a two-word phrase “king ring”, we could represent this as a vector with 2 dimensions — one for each word, “king” and “ring”. As there is a count of one for each word, our coordinates are (1,1).
Of course, this is an oversimplified example. We now move to multiple dimensions, where the words rather than the phrase, are the vectors. Here are some examples of how to translate words into a numeric format that machines can read.
The first example is called one-hot encoding. Let’s expand on our earlier phrase and change it to a sentence: “The king put on the ring”. We now have the words: “the”, “king”, “put”, “on”, “ring”. The word “the” is featured twice. We can show this sentence as a matrix where each word is represented by a column and each unique word by a row. Every time a word occurs in a row and a column, we flag it with a 1, all other examples are zero. This can be illustrated in the diagram below:
The word king would be represented by a vector [0,1,0,0,0,0].
However, this method is inefficient. It creates a very large matrix. Even in this simple example, you can see a lot more zeros than ones. We call this sparse vector space, as most of the values are zero, with only a few ones scattered amongst them. Imagine a larger text document with thousands of words, now we’d have a vector space where less than 0.1% of the elements have any value.
Each word get a unique number
The second method is to assign a unique number to each word. Our example sentence could be represented by: [1,2,3,4,1,5]. This is considered a dense vector, as all the elements have values. It is a more efficient use of resources.
However, this method also has some issues. First, the numbers that we have assigned are arbitrary — so they don’t reflect any relationships between the words. Second, when we model the data, the model learns a weight for each word. Because there is no relationship between the similarity of word encoding and the similarity of the words, the weighting is not meaningful.
The third method is called word embeddings. Word embeddings capture information about word meaning and location. They are called embeddings because they map the information in the original text into a few numbers, encoding a large amount of information into something more concise and meaningful. The values in word embeddings models are learned during model training, rather than getting assigned by programmers.
Word embeddings are a translation of a high-dimensional vector into a low-dimensional space. Each word is represented by a vector with many dimensions. This is significantly fewer dimensions than what we would require for a sparse word representation, where the dimension count could potentially be in the millions.
Higher-dimensional word embeddings capture more detailed relationships between words but require more data to train. Often, we can’t interpret the meaning of each dimension — they are inferred from the data. These dimensions are called latent dimensions. It is the distances between words in the embedding space that are meaningful, rather than numeric values along any given dimension.
Above is a diagram for our example sentence as a 4-dimensional vector space (each column is a dimension). The values themselves aren’t important, but the similarity between words is. In an embedding space, the vectorized representation of the word king would be [1.1, -1.0, -0.8, -0.1]
Word Embeddings Visualization
To help illustrate the concept of word embeddings we can use a visual with words as coordinates in 3-dimensions. Distance and direction reveal the meaning and similarity of words.
In this diagram above, the word embeddings show that the words ‘king’ and ‘queen’ have the same relationship as ‘man’ has to ‘woman’ — that is, male-female. We also see relationships between verb sense, and countries and their capitals. Notice too, that Asian countries and European countries are grouped together. What is remarkable about these relationships is that the model generated them solely from the text, not from any programmed instructions like “Ankara is the capital of Turkey”.
Perhaps most fascinating of all is that we can perform word vector arithmetic. Adding the dimensions for “king”, subtracting for man, and adding woman will get us very close to the word vector for “queen”. Likewise, we can do the same for countries and cities.
king - man + woman = queen
Tokyo -Japan + Germany = Berlin
All about the context
How does all this magic happen? The model calculates the meaning of a word from the context of the words surrounding it.
“You shall know a word by the company it keeps.” Firth, J. (1957). Studies in linguistic analysis.
There are two ways to determine meaning from context. The first method is called Continuous Bag-of-Words (CBOW). CBOW looks at the surrounding words and tries to predict our target world. In our sample sentence, it would try to predict “king” from “The — put on the ring”.
Skip-gram works in the opposite way to CBOW. It uses the word as an input and tries to predict the words around it.
Both of these methods are used in the word2vec model. According to the authors of word2vec, CBOW is faster while skip-gram does a better job for infrequent words.
In a future blog, we’ll explore how all these pieces work together in practice
Deep Learning Illustrated: A Visual, Interactive Guide to Artificial Intelligence, by Krohn Jon, Beyleveld Grant, Bassens Aglaé
Zero to AI: A non-technical, hype-free guide to prospering in the AI era, by Nicolò Valigi, Gianluca Mauro
Getting Started with Natural Language Processing, by Kochmar, Ekaterina