Word2Vec, or How Computers Learned to Talk Like We Do!
If you have been researching Natural Language Search or Natural Language Processing (NLP), you may have heard of word2vec. If not, we have you covered.
Word2Vec is a machine learning technique that has been around since 2013, courtesy of Tomas Mikolov and his data science team at Google. It’s used to train a computer to learn about your language (vocabulary, expressions, etc.) using a corpus (content library). If you aren’t a data scientist, this article digs into what you need to know to be conversant with the data scientists in your life.
How Words Are Encoded in Your Brain
Word embeddings are the vectors that are created to represent words — and their context — or the relationships between words. Let’s take a step back first to consider how we learn as individuals.
Ever have someone come up to you with a question — and you have no idea what they are talking about? Chances are they have not given you enough context to understand what they really need. Maybe they are using different words than what you are used to, or maybe they are using words that have different meanings in a different time and place.
We can try and decode what a person wants by asking questions until we can finally winnow down what they want.
Understanding natural language is not a trivial matter. And as humans, we don’t realize how incredibly complicated language is, which is why it is hard to teach computers NLP.
Structural Relationships Between Words / Context
Words rarely have meaning in isolation. If I say “leopard print” — it might be referring to the indentation of the animal’s paw on a given surface, or maybe it’s a description of a pattern that replicates a leopard’s skin. Then again, it might be a poster of the magnificent beast itself.
Without any context, it’s hard to know what it means. That’s where word embeddings come in. They show the particular relationship certain words have in relation to others in the same phrase or sentence.
When paired with clothing or accessories — we understand that “leopard print” is describing colors and markings. When paired with sizes like 24 x 36, we understand that it is a poster, and when we use it with words like sand or mud, we know that we are tracking the animal.
When paired with clothing or accessories — we understand that “leopard print” is describing colors and markings. When paired with sizes like 24 x 36, we understand that it is a poster, and when we use it with words like sand or mud, we know that we are tracking the animal.
Semantic Relationships Between Words
Your brain also understands semantic relationships between words.
For instance, it understands that word pairs like “king” and “queen”, “blue” and “yellow”, “running” and “jogging” each have a special relationship. A word pair of “king” and “running,” and “queen” with “yellow” — don’t! Our brain also understands that the semantic meaning between “shoes” and “socks” is different than, say, the relationship between “shoes” and “sandals,” or “shoes” and “running.”
Your brain knows that the word “queen” has a certain relationship with the word “England” that it doesn’t have with the word “California,” (unless we are talking about Gen Zs in California talking to each other). The word “toast” has a relationship with the word “French” that it doesn’t have with the word “Spanish.”
Now imagine the area of your brain that contains language as a well-organized storage space. Words like “French” and “toast” would be located closer together in that space than the words “toast” and “Spanish.”
Target Words, Word Vectors, Vector Representation
Finally, every word evokes a set of associations that are partially shared by all speakers of the language and partially a result of your personal experiences/geography/etc.
For example, your associations with the word “milk” might be white, 1%, breakfast, cereal, cow, almond, and if you have experience with dairy allergies, danger, ambulance, anaphylaxis.
This ability to draw these associations is the result of complex neurological computations honed by our brains over thousands of years of evolution.
Word Embeddings
Now that you have an idea of the complexity of the human language, you know that it’s not enough to just feed computers the dictionary definition of words and hope for the best.
Yet, NLP had to start somewhere, and this is where it started. Since computers understand only numbers, in order to teach them natural language, we have to represent words numerically. For a long time, it was done by representing each target word as a string of zeros with 1 in a different position for each word. This method is called “one-hot vector encoding.”
So if you have a vocabulary of four words, they would be represented like this:
This computational linguistics method creates a unique representation for each target word and therefore helps the system easily distinguish “dogs” from “cats” and “play” from “Rome.” But there are two problems with it.
First — you guessed it, this method has no way of encoding the relationship between words that as humans we take for granted. It has no way of knowing that the word pair DOG and CAT are similar — in a way that DOG and ROME are not. That CAT and PLAY have a special relationship — that CAT and ROME do not.
Second, this vector representation is problematic if you have a very large corpus. What if your vocabulary size is 10,000 words instead of four? This would require very long vectors, with each vector consisting of a long string of zeros. Many machine learning models don’t work well with this type of data — as it would make training the data really hard.
Today, both these problems are solved with the help of a modern NLP technique called word embeddings. Word embeddings are the process of converting words into dense vectors — or a series of smaller vectors. This process assigns values to each word along several different dimensions, creating a dense vector that isn’t just a string of 0s but actual coordinates in abstract space.
[For geeks out there, sparse vectors are when you have a lot of values in the vector that are zero, such as in the one-hot-vector method above. A dense vector is when most of the values in the vector are non-zero.]
As an example, lets see how this method would encode the meaning of the following five words:
Each word can be assigned a value between 0 and 1 along several different dimensions, for example, “animal”, “fluffiness”, “dangerous”, and “spooky”
Each semantic feature is a single dimension in an abstract multidimensional semantic space and is represented as its own axis. It is possible to create word vectors using anywhere from 50 to 500 dimensions (or any other number of dimensions really…).
Each word is then given specific coordinates within that space based on its specific values on the features in question. The good news is, this is not a manual job. The computer assigns these coordinates based on how often it “sees” the co-occurrences of words.
For example, the words “cat” and “aardvark” are very close on the “animal” axis, but are far from each other on the scale of fluffiness, and the words “cat” and “duvet” are similar on the scale of fluffiness but not on any other scale.
Word embedding algorithms excel at encoding a variety of semantic relationships between words. Synonyms (words that have a similar meaning) will be located very close to each other.
[The counterpart is that often antonyms are also very close in that same space. That’s how Word2vec works: words that appear in the same context — and antonyms usually do) are mapped in the same area of space.]
Other semantic relationships between words, for example, hyponymy (a subtype relationship, e.g. “spoon” is a hyponym of “cutlery”) will also be encoded.
This method also helps establish the relationship between specific target words, for example
- A leopard print was seen in the mud.
- That’s a beautiful leopard print on your coat.
- I bought a leopard print with a black frame.
The system will encode that the target word “leopard print” appears within sentences that have the words “mud,” “coat” and “frames.” (This is a called a word window and older models like Word2vec have a small window — with the targets being within 3–5 words.) By looking at what is before and after a target word, the computer is able to learn additional information about each word and locate it as precisely as possible in abstract vector space.
So to summarize, under this method, words are analyzed to see how often they appear near each other (co-occurrence). Thus, word embedding algorithms are capable of capturing the context of a word in a document, semantic and syntactic similarity, relation with other words, and so on.
As you can see below, by assigning similar words and similar vector representations, this method was able to encode the meaning of words and cluster them according to categories, such as cities, food, countries, and so on.
Over the next couple of months, we will delve into more machine learning models that come into play with deep learning and NLP.
Originally published at https://www.coveo.com on May 24, 2022.