Word Representations in Natural Language Processing
Why Word Representations
Whilst mastering natural language is easy for humans, it is something that computers have not yet been able to achieve. Humans understand language through a variety of ways for example this might be through looking up it in a dictionary, or by associating it with words in the same sentence in a meaningful way. However, computers and computer programs are yet to be fully human like and they require other different methods to understand human language. Therefore, This post will discuss those different methods for representing words and sentences so that the computers can understand the language. Word representations is one of the basic concepts in Natural Language Processing.
Different types of word representations
The following text file is used as our data for training and we will implement the models using Python.
data.txt
Whilst mastering natural language is easy for humans, it is something that computers have not yet been able to achieve. Humans understand language through a variety of ways for example this might be through looking up it in a dictionary, or by associating it with words in the same sentence in a meaningful way. However, computers and computer programs are yet to be fully human like and they require other different methods to understand human language. Therefore, This post will discuss those different methods for representing words and sentences so that the computers can understand the language. Word representations is one of the basic concepts in Natural Language Processing.
Dictionary Lookup
The simplest approach would be assigning each unique word an ID and whenever the computer needs, it can go through this dictionary and search for word : ID mappings.
Code
import pandas as pd
import numpy as npdata = open("data.txt", "r")
words = data.read().split(" ")
uniqueWords = list(set(words))
dataframe = pd.DataFrame({'word': uniqueWords})
dataframe['ID'] = np.arange(len(dataframe))
dictionary = pd.Series(dataframe.ID.values,index=dataframe.word).to_dict()
print(dictionary)
Output
{'computer': 0, 'Whilst': 1, 'to': 2, 'post': 3, 'representing': 4, 'can': 5, 'computers': 6, 'yet': 7, 'this': 8, 'words': 9, 'have': 10, 'by': 11, 'human': 12, 'language': 13, 'sentence': 14, 'ways': 15, 'might': 16, 'fully': 17, 'language.': 18, 'Language': 19, 'require': 20, 'and': 21, 'same': 22, 'of': 23, 'other': 24, 'This': 25, 'mastering': 26, 'that': 27, 'is': 28, 'not': 29, 'something': 30, 'easy': 31, 'it': 32, 'basic': 33, 'humans,': 34, 'meaningful': 35, 'a': 36, 'through': 37, 'example': 38, 'dictionary,': 39, 'Natural': 40, 'programs': 41, 'natural': 42, 'one': 43, 'different': 44, 'they': 45, 'be': 46, 'However,': 47, 'been': 48, 'achieve.': 49, 'the': 50, 'for': 51, 'with': 52, 'are': 53, 'methods': 54, 'variety': 55, 'in': 56, 'like': 57, 'discuss': 58, 'will': 59, 'sentences': 60, 'understand': 61, 'up': 62, 'associating': 63, 'way.': 64, 'looking': 65, 'so': 66, 'Humans': 67, 'or': 68, 'able': 69, 'Processing.': 70, 'concepts': 71}
One-Hot Encoding
A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. After that, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.
Code
import pandas as pddata = "Whilst mastering natural language is easy for humans"
words = data.split(" ")
uniqueWords = list(set(words))
pd.set_option('display.max_columns', None) # or 1000
pd.set_option('display.max_rows', None) # or 1000
pd.set_option('display.max_colwidth', -1)
encoded = pd.get_dummies(uniqueWords)
print(encoded)
Output
Whilst easy for humans is language mastering natural
0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0
Even though the above methods were simple to implement, the context of the words are lost in these scenarios. It means the relations between words and their surrounding words, phrases and paragraphs are lost and therefore, it becomes impossible for computers to learn the meaning of the words rather than just identifying the words. Therefore, We need more advanced models to derive relations between a word and its contextual words.
Word2Vec
Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec utilizes two architectures :
CBOW (Continuous Bag of Words)
CBOW model predicts the current word given context words within specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.
Skip gram
Skip gram predicts the surrounding context words within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.
We can implement our own word2vec model like in here. Since it will be a bit complex, we will use the Gensim, an open source library.
Install gensim with pip : pip install gensim
. You need to have C compiler install in your computer for gensim to work.
Code
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4, sg=1)
model.save("word2vec.model")data = open("data.txt", "r")
words = data.read().split(" ")
uniqueWords = list(set(words))model = Word2Vec.load("word2vec.model")
model.train([uniqueWords], total_examples=1, epochs=1)
vector = model.wv['computer'] # numpy vector of a word
print(vector)
In the above the initialization parameter sg=1 denotes that skipgram is used for word2vec and if sg=0 CBOW is used.
[ 4.9868715e-03 -1.8589090e-03 3.0031594e-04 2.9146925e-03
-3.2452017e-03 -7.4311241e-04 -1.9145171e-03 -5.4530974e-04
4.6573239e-03 1.1992530e-04 4.7853105e-03 1.7248350e-03
3.5876739e-03 3.8889768e-03 -5.2998489e-04 -1.4166962e-03
-4.3162632e-05 2.4357813e-03 -3.8080951e-03 3.2026408e-04
4.5342208e-03 2.2210747e-03 -4.1628005e-03 -2.9482227e-04
1.4657559e-03 6.7928270e-04 3.9288746e-03 -6.6122646e-04
2.6685249e-03 4.8840034e-04 1.2085927e-04 3.0190896e-03
-7.6547149e-04 1.5170782e-04 -4.8838048e-03 4.1416250e-03
2.9358426e-03 2.3107675e-03 3.2836150e-03 7.1993755e-04
-4.4702408e-03 4.2963913e-03 2.5023906e-03 1.7557575e-03
-2.6511985e-03 -3.3939728e-03 -2.2241778e-03 -4.5135348e-05
4.9574287e-03 3.7588372e-03 -1.3408092e-03 -4.9382579e-03
4.3825228e-03 -1.6619477e-03 -1.6158121e-03 4.9568298e-03
3.9215768e-03 4.5300648e-03 3.0360357e-03 -4.8058927e-03
4.3477896e-03 -2.0503579e-03 -3.2363960e-03 3.6514697e-03
3.6383464e-03 4.6341033e-03 1.7352304e-03 -1.9575742e-03
-4.8500290e-03 4.5880494e-03 4.2294217e-03 4.8814295e-04
-2.4637496e-03 -1.2094491e-03 -5.1839469e-04 -1.6737969e-03
-1.5651825e-03 3.5457567e-03 -3.4070832e-03 -1.0688258e-03
1.6415080e-03 -4.7911871e-03 -3.2562783e-03 -4.6291049e-03
-4.7947471e-03 3.7898158e-03 1.3356151e-03 -1.7311573e-03
2.5905482e-03 4.4452478e-03 -1.7256130e-03 1.6168016e-03
-3.4941530e-03 3.2339687e-03 -2.1139446e-03 -1.6573383e-03
3.3507459e-03 -3.8317447e-03 1.1735468e-03 2.6007600e-03]
Problems with word2vec
In the above implementation word2vec treats each word in a corpus like an atomic entity and generates a vector for each word. Therefore, if your model hasn’t encountered a word before, it will have no idea how to interpret it or how to build a vector for it. You are then forced to use a random vector, which is far from ideal. This can particularly be an issue in domains like Twitter where you have a lot of noisy and sparse data, with words that may only have been used once or twice in a very large corpus. FastText, an extension of word2vec model seems to be solving this problem.
Word Vectorization with fastText
fastText is a library for learning of word representations and text classification, created by Facebook’s AI Research (FAIR) lab. fastText makes available pretrained models for more than 157 languages. fastText treats each word as composed of character ngrams. Therefore, the vector for a word is made of the sum of this character n grams. For example the word vector “apple” is a sum of the vectors of the n-grams “<ap”, “app”, ”appl”, ”apple”, ”apple>”, “ppl”, “pple”, ”pple>”, “ple”, ”ple>”, ”le>”.
We shall try a Python library to play around fastText. Install fasttext with pip. As a pre requirement you will have to install cython as well.
pip install cython
pip install fasttext
Code
import fasttextmodel = fasttext.skipgram(‘data.txt’, ‘model’)
print(model.words)
We can train a fasttext model with skipgram or cbow algorithms as above. This will generate two files, model.vec and model.bin. The file model.vec will contain words in our dataset and their vector representations. Whereas model.bin will contain all the ngrams in our dataset and their vectors.
Output
{'However,', 'one', 'This', 'concepts', 'the', '', 'Language', 'Natural', 'humans,', 'so', 'language', 'in', 'words', 'can', 'understand', 'example', 'not', 'they', 'it', 'through', 'something', 'meaningful', 'and', 'basic', 'will', 'or', 'programs', 'achieve.', 'looking', 'representing', 'dictionary,', 'for', 'sentences', 'easy', 'Whilst', 'computers', 'discuss', 'be', 'sentence', 'is', 'a', 'fully', 'other', 'human', 'post', 'been', 'able', 'Humans', 'of', 'require', 'to', 'might', 'are', 'way.', 'variety', 'associating', 'this', 'same', 'like', 'by', 'methods', 'computer', 'natural', 'mastering', 'have', 'that', 'ways', 'up', 'yet', 'Processing.', 'language.', 'different', 'with'}
You can print the vector for a given word as below.
print(model['king'])
Output
[0.0027474919334053993, 0.0005356628098525107, 0.0018502300372347236, 0.0019693425856530666, 0.0016810859087854624, 5.2087707445025444e-05, 0.0018433697987347841, 0.0016153681790456176, -0.002230857964605093, -0.0011919416720047593, -0.0005365013494156301, -0.001287790248170495, -0.0005530542111955583, -0.002137718955054879, -0.0026757328305393457, -4.165512655163184e-05, 0.00331459054723382, -0.0012807429302483797, 0.0016897692112252116, -0.0004742142336908728, -0.00032369382097385824, -0.0037999653723090887, 0.00035349707468412817, -0.0005173433455638587, -0.0028595952317118645, 0.001419696374796331, 0.0019000013126060367, -0.0010566430864855647, 0.0015126612270250916, 0.005284277256578207, -0.0021161744371056557, 0.003028977895155549, 0.0022042596247047186, -0.0009013907983899117, 0.00024343356199096888, 0.0022169938310980797, 0.0015560443280264735, -0.0009531681425869465, 0.0005139008280821145, -0.0023698394652456045, 0.0008563402225263417, 0.0025476037990301847, 0.0008231972460635006, 0.0013018669560551643, 0.00041914713801816106, -0.0019356505945324898, 0.0008381576626561582, 0.0024166000075638294, 0.0023253299295902252, 0.0017737143207341433, 0.002373612718656659, -5.2668156058643945e-06, 0.0016419965540990233, -0.0008965937304310501, 0.002588749397546053, 0.00048569004866294563, 0.0009559484315104783, -0.003205464454367757, -0.0013440767070278525, 0.0014162956504151225, -0.0007057305774651468, -0.0017468031728640199, 0.0016367752104997635, -0.001270016306079924, 0.0023948214948177338, -0.0028532990254461765, -0.0016449828399345279, 0.0013536224141716957, 0.0036318846978247166, -0.0023201259318739176, 3.820220081252046e-05, 0.0003642759402282536, -0.0035634085070341825, -0.002077018143609166, 0.0030095563270151615, -0.000969761167652905, -0.0006986369844526052, -0.00021727499552071095, 2.108465378114488e-05, 0.001741308020427823, 0.0022944060619920492, -0.0012303885305300355, -0.003918013535439968, 0.0012680593645200133, -0.0021364684216678143, 0.001119954395107925, 2.959575613203924e-05, -0.0017336745513603091, -0.0016722858417779207, -0.0013483710354194045, 0.0004776633868459612, 0.0016805606428533792, -0.00017760173068381846, -0.0007585645071230829, -0.002412130357697606, 0.0005328738479875028, 0.0016983768437057734, -0.0001990617747651413, -0.0016818158328533173, -0.0009510386153124273]
This vector represents the meaning of the word. We did not have the word “king” in our dataset. However, we could generate the vector for the word. This is where fasttext stands out form other algorithms. For an unseen word, fasttext can generate the vector from its ngrams. As mentioned above the model.bin file that we generated through training contains the vectors for ngrams. The vector for an unseen word is calculated by calculating the average of all the ngram vectors of that word (mostly starting from 3-gram). Further, we can calculate the vector for a sentence by calculating the average of all the word vectors in the sentence.
Since our dataset was very small in size, the models will not be accurate and the vector may not represent the meaning of the given word. fastText provides pre trained vectors for more than 157 languages. These models have been trained using huge text corpora and the generated vector representations can be very accurate. You can download the pre trained vectors from here. Then you can use fasttext.load_model to load the pre-trained model.
model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model['king'] # get the vector of the word 'king'