One Hot encoding of text data in Natural Language Processing.
One of the most interesting applications of Machine Learning and Deep Learning can be found in the field of Natural Language Processing (NLP). Many tasks in NLP involve working with texts and sentences which are understood as sequence of texts. But the Neural Networks which are part of Machine Learning models require their input in tensors or vectors whose constituent elements are in numerical form. So how is the data present in the form of text fed as input to such a neural network model? One of the methods which enables us to do this, and we will discuss below is called One Hot encoding.
In one hot encoding, every word (even symbols) which are part of the given text data are written in the form of vectors, constituting only of 1 and 0 . So one hot vector is a vector whose elements are only 1 and 0. Each word is written or encoded as one hot vector, with each one hot vector being unique. This allows the word to be identified uniquely by its one hot vector and vice versa, that is no two words will have same one hot vector representation. For example see the below image shows one hot encoding of words in the given sentence.
Notice that in the image to the left the words ‘The’ and ‘the’ have different encoding implying they are different. Thus we are representing every word and symbols in the text data as a unique one hot vector which contains numerical data(1 and 0) as its constituent elements. One word is represented as a vector therefore the list of words in the sentence can be represented as an array of vectors or a matrix and if we have list of sentences whose words are one hot encoded then it will result in an array whose elements are matrices. So we end up with a three dimensional tensor which can be fed to the Neural network.
Now let’s check its implementation in Python using numpy library from scratch.
On the first line, we have our set of samples consisting of two sentences. Next we create an empty Python dictionary for storing our words(keys) and their corresponding indices(values), followed by a counter set to 0 to count the number of key-value pairs in the dictionary. The first for loop iterates over the sentences while the following for loop in the next line iterates over each word in the selected sentence and splits each word returning a list of strings. Then if the value of the current_word variable is not in the dictionary token_index then we add it to the token_index dictionary and assign it an index equal to the value of counter variable and increment it by one to start our index from 1 instead of 0 and also increment the value of counter by 1.
Thus we get a dictionary as shown in the above image. Notice that there are 11 words in samples but only 10 indices because ‘has’ is repeated and even symbols and numbers (which are represented as stings) have a index assigned to them, because they are also text character separated by space. Also maximum length of dictionary is therefore 10.
Next, we create a tensor of consisting of 0 as it’s elements.
The results is an three dimensional tensor which has two (the number of elements in ‘samples’) matrices as its element. Each of those matrices has max_length number of rows(here 6) and max(token_index.values()) + 1 columns(here 10 + 1 = 11). Thus we have now a tensor of shape (2, 6, 11).
Now we create a one hot representation from the previous tensor with zero as it’s element.
In the first for loop, we consider each sentence along with its index in the enumerate object returned by the enumerate(samples).
The enumerate object is converted to list (code to the left) and the for loop is iterated over the list elements which are words. In the next line we get the values of the words(keys) in token_index . For example the value of ‘Neptune’ is 7 in token_index . Then set the elements with positional indices equal to (i, j, index) in the previous tensor of zeroes equal to 1. Again consider our previous example of ‘Neptune’, here we begin our counting of rows and columns from 0, since it belongs to second sentence so i = 1 implying it is part of the second matrix in the resultant tensor, it is first word of sentence so it is in the zeroth row of second matrix and its value in token_index is 7th column will be set to 1.0.
The result is a one hot representation of sentences in our samples. A tensor of shape (2, 6, 11).
So we have represented our data in ‘samples’ as One hot.