Word Representations in Natural Language Processing

Kabilesh Kumararatnam
Tech-Sauce
Published in
8 min readJun 21, 2019

Why Word Representations

Whilst mastering natural language is easy for humans, it is something that computers have not yet been able to achieve. Humans understand language through a variety of ways for example this might be through looking up it in a dictionary, or by associating it with words in the same sentence in a meaningful way. However, computers and computer programs are yet to be fully human like and they require other different methods to understand human language. Therefore, This post will discuss those different methods for representing words and sentences so that the computers can understand the language. Word representations is one of the basic concepts in Natural Language Processing.

Different types of word representations

The following text file is used as our data for training and we will implement the models using Python.

data.txt

Whilst mastering natural language is easy for humans, it is something that computers have not yet been able to achieve. Humans understand language through a variety of ways for example this might be through looking up it in a dictionary, or by associating it with words in the same sentence in a meaningful way. However, computers and computer programs are yet to be fully human like and they require other different methods to understand human language. Therefore, This post will discuss those different methods for representing words and sentences so that the computers can understand the language. Word representations is one of the basic concepts in Natural Language Processing.

Dictionary Lookup

The simplest approach would be assigning each unique word an ID and whenever the computer needs, it can go through this dictionary and search for word : ID mappings.

Code

import pandas as pd
import numpy as np
data = open("data.txt", "r")
words = data.read().split(" ")
uniqueWords = list(set(words))
dataframe = pd.DataFrame({'word': uniqueWords})
dataframe['ID'] = np.arange(len(dataframe))
dictionary = pd.Series(dataframe.ID.values,index=dataframe.word).to_dict()
print(dictionary)

Output

{'computer': 0, 'Whilst': 1, 'to': 2, 'post': 3, 'representing': 4, 'can': 5, 'computers': 6, 'yet': 7, 'this': 8, 'words': 9, 'have': 10, 'by': 11, 'human': 12, 'language': 13, 'sentence': 14, 'ways': 15, 'might': 16, 'fully': 17, 'language.': 18, 'Language': 19, 'require': 20, 'and': 21, 'same': 22, 'of': 23, 'other': 24, 'This': 25, 'mastering': 26, 'that': 27, 'is': 28, 'not': 29, 'something': 30, 'easy': 31, 'it': 32, 'basic': 33, 'humans,': 34, 'meaningful': 35, 'a': 36, 'through': 37, 'example': 38, 'dictionary,': 39, 'Natural': 40, 'programs': 41, 'natural': 42, 'one': 43, 'different': 44, 'they': 45, 'be': 46, 'However,': 47, 'been': 48, 'achieve.': 49, 'the': 50, 'for': 51, 'with': 52, 'are': 53, 'methods': 54, 'variety': 55, 'in': 56, 'like': 57, 'discuss': 58, 'will': 59, 'sentences': 60, 'understand': 61, 'up': 62, 'associating': 63, 'way.': 64, 'looking': 65, 'so': 66, 'Humans': 67, 'or': 68, 'able': 69, 'Processing.': 70, 'concepts': 71}

One-Hot Encoding

A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. After that, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.

Code

import pandas as pddata = "Whilst mastering natural language is easy for humans"
words = data.split(" ")
uniqueWords = list(set(words))
pd.set_option('display.max_columns', None) # or 1000
pd.set_option('display.max_rows', None) # or 1000
pd.set_option('display.max_colwidth', -1)
encoded = pd.get_dummies(uniqueWords)
print(encoded)

Output

Whilst  easy  for  humans  is  language  mastering  natural
0 0 0 0 0 1 0 0
0 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0
0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1
0 1 0 0 0 0 0 0
0 0 0 0 0 0 1 0
0 0 1 0 0 0 0 0

Even though the above methods were simple to implement, the context of the words are lost in these scenarios. It means the relations between words and their surrounding words, phrases and paragraphs are lost and therefore, it becomes impossible for computers to learn the meaning of the words rather than just identifying the words. Therefore, We need more advanced models to derive relations between a word and its contextual words.

Word2Vec

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2Vec utilizes two architectures :

CBOW (Continuous Bag of Words)

CBOW model predicts the current word given context words within specific window. The input layer contains the context words and the output layer contains the current word. The hidden layer contains the number of dimensions in which we want to represent current word present at the output layer.

Center word and Context words

Skip gram

Skip gram predicts the surrounding context words within specific window given current word. The input layer contains the current word and the output layer contains the context words. The hidden layer contains the number of dimensions in which we want to represent current word present at the input layer.

CBOW and Skip-gram models

We can implement our own word2vec model like in here. Since it will be a bit complex, we will use the Gensim, an open source library.

Install gensim with pip : pip install gensim . You need to have C compiler install in your computer for gensim to work.

Code

from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
path = get_tmpfile("word2vec.model")
model = Word2Vec(common_texts, size=100, window=5, min_count=1, workers=4, sg=1)
model.save("word2vec.model")
data = open("data.txt", "r")
words = data.read().split(" ")
uniqueWords = list(set(words))
model = Word2Vec.load("word2vec.model")
model.train([uniqueWords], total_examples=1, epochs=1)
vector = model.wv['computer'] # numpy vector of a word
print(vector)

In the above the initialization parameter sg=1 denotes that skipgram is used for word2vec and if sg=0 CBOW is used.

[ 4.9868715e-03 -1.8589090e-03  3.0031594e-04  2.9146925e-03
-3.2452017e-03 -7.4311241e-04 -1.9145171e-03 -5.4530974e-04
4.6573239e-03 1.1992530e-04 4.7853105e-03 1.7248350e-03
3.5876739e-03 3.8889768e-03 -5.2998489e-04 -1.4166962e-03
-4.3162632e-05 2.4357813e-03 -3.8080951e-03 3.2026408e-04
4.5342208e-03 2.2210747e-03 -4.1628005e-03 -2.9482227e-04
1.4657559e-03 6.7928270e-04 3.9288746e-03 -6.6122646e-04
2.6685249e-03 4.8840034e-04 1.2085927e-04 3.0190896e-03
-7.6547149e-04 1.5170782e-04 -4.8838048e-03 4.1416250e-03
2.9358426e-03 2.3107675e-03 3.2836150e-03 7.1993755e-04
-4.4702408e-03 4.2963913e-03 2.5023906e-03 1.7557575e-03
-2.6511985e-03 -3.3939728e-03 -2.2241778e-03 -4.5135348e-05
4.9574287e-03 3.7588372e-03 -1.3408092e-03 -4.9382579e-03
4.3825228e-03 -1.6619477e-03 -1.6158121e-03 4.9568298e-03
3.9215768e-03 4.5300648e-03 3.0360357e-03 -4.8058927e-03
4.3477896e-03 -2.0503579e-03 -3.2363960e-03 3.6514697e-03
3.6383464e-03 4.6341033e-03 1.7352304e-03 -1.9575742e-03
-4.8500290e-03 4.5880494e-03 4.2294217e-03 4.8814295e-04
-2.4637496e-03 -1.2094491e-03 -5.1839469e-04 -1.6737969e-03
-1.5651825e-03 3.5457567e-03 -3.4070832e-03 -1.0688258e-03
1.6415080e-03 -4.7911871e-03 -3.2562783e-03 -4.6291049e-03
-4.7947471e-03 3.7898158e-03 1.3356151e-03 -1.7311573e-03
2.5905482e-03 4.4452478e-03 -1.7256130e-03 1.6168016e-03
-3.4941530e-03 3.2339687e-03 -2.1139446e-03 -1.6573383e-03
3.3507459e-03 -3.8317447e-03 1.1735468e-03 2.6007600e-03]

Problems with word2vec

In the above implementation word2vec treats each word in a corpus like an atomic entity and generates a vector for each word. Therefore, if your model hasn’t encountered a word before, it will have no idea how to interpret it or how to build a vector for it. You are then forced to use a random vector, which is far from ideal. This can particularly be an issue in domains like Twitter where you have a lot of noisy and sparse data, with words that may only have been used once or twice in a very large corpus. FastText, an extension of word2vec model seems to be solving this problem.

Word Vectorization with fastText

fastText is a library for learning of word representations and text classification, created by Facebook’s AI Research (FAIR) lab. fastText makes available pretrained models for more than 157 languages. fastText treats each word as composed of character ngrams. Therefore, the vector for a word is made of the sum of this character n grams. For example the word vector “apple” is a sum of the vectors of the n-grams “<ap”, “app”, ”appl”, ”apple”, ”apple>”, “ppl”, “pple”, ”pple>”, “ple”, ”ple>”, ”le>”.

We shall try a Python library to play around fastText. Install fasttext with pip. As a pre requirement you will have to install cython as well.

pip install cython
pip install fasttext

Code

import fasttextmodel = fasttext.skipgram(‘data.txt’, ‘model’)
print(model.words)

We can train a fasttext model with skipgram or cbow algorithms as above. This will generate two files, model.vec and model.bin. The file model.vec will contain words in our dataset and their vector representations. Whereas model.bin will contain all the ngrams in our dataset and their vectors.

Output

{'However,', 'one', 'This', 'concepts', 'the', '', 'Language', 'Natural', 'humans,', 'so', 'language', 'in', 'words', 'can', 'understand', 'example', 'not', 'they', 'it', 'through', 'something', 'meaningful', 'and', 'basic', 'will', 'or', 'programs', 'achieve.', 'looking', 'representing', 'dictionary,', 'for', 'sentences', 'easy', 'Whilst', 'computers', 'discuss', 'be', 'sentence', 'is', 'a', 'fully', 'other', 'human', 'post', 'been', 'able', 'Humans', 'of', 'require', 'to', 'might', 'are', 'way.', 'variety', 'associating', 'this', 'same', 'like', 'by', 'methods', 'computer', 'natural', 'mastering', 'have', 'that', 'ways', 'up', 'yet', 'Processing.', 'language.', 'different', 'with'}

You can print the vector for a given word as below.

print(model['king'])

Output

[0.0027474919334053993, 0.0005356628098525107, 0.0018502300372347236, 0.0019693425856530666, 0.0016810859087854624, 5.2087707445025444e-05, 0.0018433697987347841, 0.0016153681790456176, -0.002230857964605093, -0.0011919416720047593, -0.0005365013494156301, -0.001287790248170495, -0.0005530542111955583, -0.002137718955054879, -0.0026757328305393457, -4.165512655163184e-05, 0.00331459054723382, -0.0012807429302483797, 0.0016897692112252116, -0.0004742142336908728, -0.00032369382097385824, -0.0037999653723090887, 0.00035349707468412817, -0.0005173433455638587, -0.0028595952317118645, 0.001419696374796331, 0.0019000013126060367, -0.0010566430864855647, 0.0015126612270250916, 0.005284277256578207, -0.0021161744371056557, 0.003028977895155549, 0.0022042596247047186, -0.0009013907983899117, 0.00024343356199096888, 0.0022169938310980797, 0.0015560443280264735, -0.0009531681425869465, 0.0005139008280821145, -0.0023698394652456045, 0.0008563402225263417, 0.0025476037990301847, 0.0008231972460635006, 0.0013018669560551643, 0.00041914713801816106, -0.0019356505945324898, 0.0008381576626561582, 0.0024166000075638294, 0.0023253299295902252, 0.0017737143207341433, 0.002373612718656659, -5.2668156058643945e-06, 0.0016419965540990233, -0.0008965937304310501, 0.002588749397546053, 0.00048569004866294563, 0.0009559484315104783, -0.003205464454367757, -0.0013440767070278525, 0.0014162956504151225, -0.0007057305774651468, -0.0017468031728640199, 0.0016367752104997635, -0.001270016306079924, 0.0023948214948177338, -0.0028532990254461765, -0.0016449828399345279, 0.0013536224141716957, 0.0036318846978247166, -0.0023201259318739176, 3.820220081252046e-05, 0.0003642759402282536, -0.0035634085070341825, -0.002077018143609166, 0.0030095563270151615, -0.000969761167652905, -0.0006986369844526052, -0.00021727499552071095, 2.108465378114488e-05, 0.001741308020427823, 0.0022944060619920492, -0.0012303885305300355, -0.003918013535439968, 0.0012680593645200133, -0.0021364684216678143, 0.001119954395107925, 2.959575613203924e-05, -0.0017336745513603091, -0.0016722858417779207, -0.0013483710354194045, 0.0004776633868459612, 0.0016805606428533792, -0.00017760173068381846, -0.0007585645071230829, -0.002412130357697606, 0.0005328738479875028, 0.0016983768437057734, -0.0001990617747651413, -0.0016818158328533173, -0.0009510386153124273]

This vector represents the meaning of the word. We did not have the word “king” in our dataset. However, we could generate the vector for the word. This is where fasttext stands out form other algorithms. For an unseen word, fasttext can generate the vector from its ngrams. As mentioned above the model.bin file that we generated through training contains the vectors for ngrams. The vector for an unseen word is calculated by calculating the average of all the ngram vectors of that word (mostly starting from 3-gram). Further, we can calculate the vector for a sentence by calculating the average of all the word vectors in the sentence.

Since our dataset was very small in size, the models will not be accurate and the vector may not represent the meaning of the given word. fastText provides pre trained vectors for more than 157 languages. These models have been trained using huge text corpora and the generated vector representations can be very accurate. You can download the pre trained vectors from here. Then you can use fasttext.load_model to load the pre-trained model.

model = fasttext.load_model('model.bin')
print model.words # list of words in dictionary
print model['king'] # get the vector of the word 'king'

--

--