Text Preprocessing for NLP Part — 3

Sanjithkumar
11 min readSep 18, 2023

--

Photo by Shahadat Rahman on Unsplash

In the last Blog, we had a brief on the different text vectorization techniques and we saw in detail about the Count Vectorization method in detail with an example… So I hope whomever reading this have some knowledge on the need for vectorization and these preprocessing steps…

In this days’ blog I will talk about word embeddings, more clearly I will talk about Word2Vec which simply stands for Word-To-Vector and is one of the most popular methods that is used by developers in the field of Natural Language Processing.

Word2Vec:

Word2Vec is a popular natural language processing (NLP) technique used for word embedding, which is the process of converting words or phrases into numerical vectors. It was developed by Tomas Mikolov and his team at Google in 2013. Word2Vec has been influential in various NLP tasks such as text classification, sentiment analysis, machine translation, and more, as it captures the semantic relationships between words by representing them in a continuous vector space.

Word2vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through word2vec have proven to be successful on a variety of downstream natural language processing tasks.

  • Traditional NLP models, like bag-of-words (BoW) and one-hot encoding, represent words as discrete symbols, which do not capture semantic relationships between words.
  • Word2Vec aims to represent words in a continuous vector space, where words with similar meanings are close to each other in this space.
  • Word2Vec models are trained on large text corpus.
  • The core idea is to learn word embeddings by optimizing a neural network’s weights through a process called backpropagation.
  • The neural network has an embedding layer that transforms words into vectors and a hidden layer that learns to predict context words or target words based on the input.

There are two kinds of algorithms that generally comes up when it comes to word2vec they are as follows:

  1. Skip-Gram algorithm
  2. Continuous Bag of words

Skip-Gram algorithm:

fig-2 link

Skip-Gram algorithm is model that is trained to predict the context given a word… Let’s see an example:

“Consistency is important than hard work”

Now the general workflow for the skip-gram algorithm is as follows:

step 1: Target word: “Consistency”, then decide the context vector: [“is”]

step 2: Target word: “is”, then decide the context vector: [“Consistency”, “important”]

step 3: Target word: “important”, then decide the context vector: [“is”, “than”] and so on…

now can we be sure that a word can only be represented by the preceding and succeeding… sometimes more than two words in a sentence can represent the target word, when it comes to important the context vector can also be [“consistency”, “is”, “than”, “hard”] and for many of you this option might seem more feasible, as having the context words “is” and “than” doesn’t provide any reliable context… Here is where the “window size” comes in, it allows us to take into consideration the “n” words before and after a target word. So with a window size of 2, the context vector for the target word “important” is [“consistency”, “is”, “than”, “hard”].

step 4: Now you train a model using the target word to predict the context word after one-hot encoding every single word… The model will be a two layer model one input, one hidden and one output as shown in fig 2. Your new word vectors will be derived from the weights W vxn (target_vector to hidden layer) in the first layer.

Continuous Bag of Words algorithm:

fig-3 link

Context bag of words. Conceptually this is just the opposite of skip-gram. Here we try to predict the target word from a list of context words. So for our example, we will have the input as [“consistency”, “is”, “than”, “hard”] and we will need to predict “important” from it.

We will see about this method in a future blog…

Embedding space / Vector Space:

  • After training, each word is represented as a high-dimensional vector in a continuous space.
  • Words with similar meanings or contexts will have vectors that are closer in this space.
  • You can perform mathematical operations on word vectors (e.g., vector addition and subtraction) to capture semantic relationships. For example, king - man + woman ≈ queen.
fig 1. Vector Space

The main concept behind Word2Vec is that the resultant embeddings ( which are nothing but vectors) after training, maintains the relationship between the words… When projected on a continuous vector space the related words are close to each other than the non related words…

For Example:

Take the words refrigerator, oven and concrete. The words refrigerator is more closely related to oven than it is to concrete… considering they have the following vectors or embeddings

refrigerator = [-0.4,0.6]

oven = [-0.2,0.6]

concrete = [0.3,-0.5]

It is clear that the words refrigerator is closer to oven than it is to the word concrete, the same thing you’ll see in the above given figure which represents a continuous vector space.

Main advantages over bag of words:

  • Word2Vec embeddings capture semantic relationships, allowing NLP models to perform better on various tasks.
  • It reduces the dimensionality of the feature space, making it computationally efficient.
  • It generalizes well to different NLP tasks without task-specific feature engineering.

Code Implementation of Word2Vec:

We will see the step-by-step procedure for implementing word2vec using numpy and keras in python:

The first step will be to create a Word2Vec class which will represent the data that we are going to use,

important attributes are as follows:

  1. word_to_index is a dictionary used to keep track of words and their corresponding indices.
  2. index_to_word is the inverse of word_to_index, it stores the index as keys and the words as values.
  3. count is a counter used to keep track of the words and populate index_to_word and word_to_index dictionaries.
  4. word_count is the number of unique words.
  5. vocab is a list that contains all the unique words.
class Word2Vec:

def __init__(self,input_file_path,stop_words = None):
self.input_file_path = input_file_path
self.word_count = None
self.count = 0
#self.vocab_size = None
self.stop_words = stop_words
self.word_to_index = {}
self.index_to_word = {}
self.vocab = []

#self._target_words = []
#self._context_vectors = []
#self._target_to_context_data = {}

self.data = self._read_file(self.input_file_path)
self._Prepare_data_utils(self.data)
#self.vocab = self.vocab[:1000]
#self.word_count = len(self.vocab)

def process(self,window_size):
#data = self._read_file(self.input_file_path)
#self._Prepare_data_utils(data)
return self._generate_training_data(window_size)

The _read_file is a private function that reads the input file and performs all the preprocessing step that we saw in my first blog on this series

The _prepare_data_utils function populates the aforementioned attributes

Considering our example: “Consistency is important than hard work”

  1. word_to_index = {“consistency” : 0, “is”: 1, “important” : 2, “than” : 3, “hard” : 4, “work” : 5}
  2. index_to_word = {0 : “consistency” , 1: “is”, 2: “important” , 3: “than” , 4 : “hard” , 5: “work”}
  3. vocab = [“Consistency”, “is”, “important”, “than”, “hard”, “work”]
  4. word_count = 6
  def _read_file(self,remove_stop_words = False):
file_contents = []
if os.path.exists(self.input_file_path):

with open(self.input_file_path) as f:
file_contents = f.read()
data = []
for sent in file_contents.split('.'):
sent = re.findall("[A-Za-z]+", sent)
new_sent = ''
for words in sent:

if self.stop_words is not None:
if len(words) > 1 and words not in self.stop_words:
new_sent = new_sent + ' ' + words
continue
if len(words) > 1 :
new_sent = new_sent + ' ' + words
data.append(new_sent)
return data
else:
raise Exception("File Path Does Not Exist")

def _Prepare_data_utils(self,data):
for sent in data:
for word in sent.split():
word = word.lower()
self.vocab.append(word)
if word not in self.word_to_index:
self.word_to_index[word] = self.count
self.index_to_word[self.count] = word
self.count += 1
self.word_count = len(self.vocab)

One Hot Encoding is a process of encoding texts into zeros and ones, this function converts each word into a one hot encoded vector:

Considering our example:

When Target word: “is”, then decide the context vector: [“Consistency”, “important”]

The Target Vector would be [0,1,0,0,0,0]. Since in the dictionary word_to_index, the index of the word “is” is 1.

The Context Vector would be [1,0,1,0,0,0]. since “Consistency” corresponds to index 0 and “important” to index 2 respectively.(Note that this is quite different from your common one hot encoding, where you’ll use two different vectors to represent “Consistency”([1,0,0,0,0,0]) and “important”([0,0,1,0,0,0]), But to avoid memory constraints I resorted to using a slightly mutated version which is also popular).

  def _one_hot_encode(self,target_word,context_words):
target_vector = np.zeros(len(self.vocab))
context_vector = np.zeros(len(self.vocab))
target_index = self.word_to_index.get(target_word)
for word in context_words:
context_index = self.word_to_index.get(word)
context_vector[context_index] = 1
target_vector[target_index] = 1
return target_vector,context_vector

The _generate_training_data is the main function that performs the skip-gram algorithm, here positive and negative data with labels are generated for training… Let’s just see for one word “important” from our example and look at how data is generated

Let window_size be 2, the target_word will be “important” and for the context_words of a negative sample, we randomly sample window_size*2 words from the vocabulary i.e. 4 random words(since in our example there are only 6 words I really can’t explain this but when you consider a huge corpus of vocabulary the words will be unrelated to the target word) this pair of negative sample has a label of 0, In case of a positive sample of window_size 2, The target_word will be “important” it’s context words will be [“consistency”, “is”, “than”, “hard”] and the label will be 1.

Words like “is”, “a”, “an”, “the” can be ignored as they don’t convey any context on their own…

Here is an example set for a negative sample from the program.

target_vector = [1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. ...........0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (shape (None,1000))
context_vector = [1. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. . . . . . . . . 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] (shape (None,1000))
label = 0 (shape (None,))
  def _generate_training_data(self,window_size,gen_negative_data = True):
target_vectors, context_vectors, labels = [],[],[]
if gen_negative_data:
for index,word in enumerate(self.vocab):
target = word
context_words = random.sample(self.vocab,window_size*2)
target_vector,context_vector = self._one_hot_encode(target,context_words)
labels.append(0)
target_vectors.append(target_vector)
context_vectors.append(context_vector)


for index,word in enumerate(self.vocab):
target = word
context_words = []
if index == 0:
context_words = [self.vocab[idx] for idx in range(index+1,index+1+window_size)]
elif index == self.word_count - 1:
context_words = [self.vocab[idx] for idx in range(index-1,index-1-window_size,-1)]
else:
#right side
for idx in range(index+1,index+1+window_size):
if idx < len(self.vocab)-1:
#print(index)
context_words.append(self.vocab[idx])
continue
break

#left side
for idx in range(index-1,index-1-window_size,-1):
if idx > 0:
context_words.append(self.vocab[idx])
continue
break
target_vector,context_vector = self._one_hot_encode(target,context_words)
labels.append(1)
target_vectors.append(target_vector)
context_vectors.append(context_vector)

return np.array(target_vectors), np.array(context_vectors), np.array(labels)
path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
vectorizer = Word2Vec(path_to_file)
Autotune = tf.data.AUTOTUNE
target_vectors,context_vectors,labels = vectorizer.process(2)
data = tf.data.Dataset.from_tensor_slices(((target_vectors,context_vectors),labels))
data = data.cache().shuffle(5000).batch(1000).prefetch(Autotune)

In the above code I simply downloaded a basic text processing data, and generated the training set and then converted them into a tf.Dataset object.

The below code is a custom Word2Vec functional model and a basic one that is used to get the weights, it is important to understand that we are not looking to classify something here but what we are actually trying to do is to get a set of optimal weights that will properly represent the words based on it’s context.

class Word2VecModel(tf.keras.Model):

def __init__(self,vocab_size,emb_dim):
super(Word2VecModel,self).__init__()
self.target_embedding = tf.keras.layers.Embedding(vocab_size,emb_dim,name = "embedding_1")

self.context_embedding = tf.keras.layers.Embedding(vocab_size,
emb_dim,
name = "embedding_2")
self.flatten = tf.keras.layers.Flatten()
self.dense = tf.keras.layers.Dense(1,activation = "sigmoid")

def call(self,x):
target,context = x
word_em1 = self.target_embedding(target)
word_em2 = self.context_embedding(context)
dots = tf.math.add(word_em1,word_em2)
dots = self.flatten(dots)
dots = self.dense(dots)
return dots

Here for simplicity I trained the model on 1000 words each word represented by an embedding size of 120 i.e. the size of the hidden layer.

my_model = Word2VecModel(1000,120)
my_model.compile(optimizer='adam',
loss=tf.keras.losses.BinaryCrossentropy(from_logits = False),
metrics=['accuracy'])
my_model.fit(data,epochs = 20)

Now if you look closely and pay attention to the weights I am trying to retrieve, you will see that it is the weights between the input target_vector layer and the embedding layer named “embedding_1”.

weights = my_model.get_layer('embedding_1').get_weights()[0]
array([[-0.02310475, -0.00200859, -0.03876656, ..., -0.00859825,
0.00285395, -0.00862287],
[-0.00410956, -0.05728572, -0.01264317, ..., -0.00449263,
-0.01423586, -0.04030272],
[ 0.00954651, 0.00488069, 0.04626249, ..., 0.03898971,
0.01442141, 0.00423387],
...,
[ 0.0266178 , 0.00356406, -0.04178107, ..., -0.01018769,
0.00633867, -0.02287768],
[-0.03295443, -0.00057896, 0.00699287, ..., 0.0445875 ,
0.04261236, 0.01465371],
[-0.00072701, 0.03105271, -0.01963968, ..., -0.04909495,
0.00587721, -0.00573009]], dtype=float32)
weights.shape
(1000, 120)

If you look at the shape you will understand. There are 1000 words, each word represented by a vector..

Now to see how the word embedding for each word looks like when using skip-gram word2vec you can look at the output of the code below.

Each word has it’s own vector of shape (None,120)

for i in range(3):
print(f"{vectorizer.vocab[i]} --> {weights[i]}")
first --> [-0.02310475 -0.00200859 -0.03876656  0.02102066 -0.02332496 -0.01372258
0.02566614 -0.01525349 0.02167136 0.00359732 0.02660094 -0.00380347
-0.00711286 -0.00277696 0.02213457 0.01442237 0.02705166 -0.02413151
0.02124317 0.02058548 -0.02252698 -0.04370767 -0.01270716 0.01362769
0.03234105 0.00991848 -0.00896527 0.02926731 0.00911664 -0.02375735
-0.02751724 0.02804366 -0.01344621 -0.00629256 -0.02606178 0.00215797
-0.02266966 -0.00143781 0.03241628 0.01380005 0.03881672 -0.02476944
-0.02332647 -0.03774424 -0.02213713 -0.02122323 0.00072905 -0.00204702
-0.01274652 0.0066981 0.01298382 0.01617453 -0.00806122 -0.02190874
-0.01311323 0.01249751 -0.00210295 0.02601673 0.02048907 -0.01612479
-0.02513399 -0.01583612 -0.00866913 -0.0075117 -0.02043742 0.01486086
-0.00831461 -0.02407211 -0.0084334 -0.00706029 0.01396245 -0.01701266
-0.04689959 0.0137794 -0.01268356 0.01844781 0.03714399 -0.03092293
-0.00867312 -0.01278079 0.0373552 0.00378405 -0.00805337 0.0254508
-0.04463262 0.0230299 -0.03902265 0.02405916 0.01490924 0.03035023
0.00807145 0.02136353 0.00621097 0.04742548 0.02268744 0.02508656
0.03214198 0.0141999 -0.0036273 -0.00944887 0.04548135 0.03261822
-0.00450574 0.00553209 0.00874641 -0.03537415 -0.014837 -0.01257267
-0.01155591 -0.03443324 -0.00371254 -0.02028203 -0.00969815 -0.04486717
0.0162504 -0.00207139 -0.00789557 -0.00859825 0.00285395 -0.00862287]
citizen --> [-0.00410956 -0.05728572 -0.01264317 0.05890079 0.03213564 -0.04856756
0.006817 0.00971984 0.0548711 0.04948877 -0.00601344 0.02023276
0.04218708 -0.0359037 0.05394344 -0.00359714 -0.05305861 0.04892066
0.02037724 0.02198882 0.06779727 -0.00096199 -0.06675839 -0.02995355
0.02225785 0.0413808 -0.02291583 0.02196655 0.00351348 -0.00966354
0.00346691 0.03316642 0.03224444 -0.03004048 0.03474271 0.02987783
0.01351574 0.00788232 -0.01221279 -0.01864956 0.02042673 -0.00307584
0.00866549 0.03599598 0.03743691 -0.03006205 -0.0081626 -0.02358445
-0.06093378 -0.036452 -0.03515859 0.00775395 -0.03216971 -0.00438268
0.03621898 0.02627047 -0.0324512 -0.03177933 0.00123988 0.0288799
-0.02864826 0.02137098 0.05381231 0.00259382 -0.00639871 -0.03912453
0.01063962 0.04047365 -0.00257662 0.06156946 0.02030049 0.02971134
0.01004253 -0.05250796 -0.00025513 0.00608454 -0.00320571 -0.01324324
0.01657553 -0.00181611 -0.00954415 0.01849543 -0.03822352 0.02481043
-0.00207916 -0.02524566 -0.00162204 -0.02674341 0.02574393 -0.01599589
0.01372321 -0.01040668 -0.01154426 -0.03452919 0.05554885 -0.06467807
-0.00843349 0.00691917 0.04216848 -0.01049247 -0.03223878 0.06233244
0.00865326 -0.02177565 0.0159393 0.04086575 -0.03640402 0.02182319
0.01537229 -0.02400628 -0.0137303 -0.0340028 0.00032738 -0.01440961
0.02334342 -0.04427146 -0.026365 -0.00449263 -0.01423586 -0.04030272]
before --> [ 0.00954651 0.00488069 0.04626249 -0.03917541 -0.01564524 -0.04950831
0.03916881 0.0173671 -0.03409598 0.02075822 -0.0088353 -0.00020568
0.01904334 -0.01663901 -0.00493728 0.03888226 0.03318102 0.01797071
0.01435292 0.02863533 -0.03691536 -0.04302084 0.03476962 0.01542181
-0.0298581 -0.04130291 0.04644031 -0.02762159 0.00344064 -0.03833158
-0.02139124 0.00529104 -0.02112808 -0.01681819 -0.00425576 0.01943007
-0.03092628 -0.03074777 -0.02779267 -0.03174068 -0.03388404 -0.00064671
-0.00131323 -0.03669352 0.0045494 0.04583145 0.010047 -0.00487123
0.03789339 0.04309889 0.04931576 0.00815551 -0.01019833 -0.04489645
0.02264648 0.02798642 0.03864842 -0.00089893 0.00685357 -0.01606183
0.00441351 0.01041739 -0.01554643 0.00725418 -0.02612382 -0.00924844
0.01256322 0.03827821 0.03541933 0.01579515 0.0481034 0.0444251
-0.02436733 0.02675876 0.04412528 0.0033959 0.01202037 0.04412648
-0.04102162 -0.00228269 0.00515808 0.00444824 0.01954957 0.04822811
-0.03481095 -0.02920536 -0.00457232 -0.0274297 0.04339688 0.0199146
-0.01350059 -0.00362799 -0.02872139 0.03262046 0.02076607 0.02602286
0.03260423 0.033685 -0.00192542 0.04619341 0.02314458 -0.00936342
0.02051053 -0.00566466 -0.03193802 -0.04142316 0.04088009 0.02074022
0.04763401 -0.04489685 -0.00120794 -0.03320307 -0.0157941 -0.02380283
-0.02146627 0.04743249 -0.0318869 0.03898971 0.01442141 0.00423387]

In this blog you saw a comprehensive approach to word2vec using skip-gram method. The method I implemented is for understanding and not for deployment as this can be highly optimized by using powerful frameworks like Tensorflow or Pytorch, you can look into Tensorflow’s comprehensive and optimized explanation and code, you can also look into this article this provides a much more simpler explanation . There are many other ways that you can use to vectorize text we will see the others in future blogs… Till then I hope This helps you.

--

--

Sanjithkumar

Deep Learning || MLOPS || GenAI Enthusiast || GCP || Azure || DOM manipulation