Text PreProcessing For NLP Part — 4

Sanjithkumar
11 min readSep 25, 2023

--

Photo by Chris Ried on Unsplash

In the last blog, I went through the implementation of Word2Vec Text Vectorization method using the Skip-Gram algorithm, in this blog I will continue with the Word2Vec algorithm using Continuous Bag Of Words(CBOW) algorithm. In this blog I will also implement the CBOW algorithm in python.

I will also recommend you to read my previous blogs if you are new to this field to have a basic understanding. Here I will give a simple outline of what is Word2Vec and it’s need.

Word2Vec:

Word2Vec is a popular natural language processing (NLP) technique used for word embedding, which is the process of converting words or phrases into numerical vectors. It was developed by Tomas Mikolov and his team at Google in 10 years ago(2013). Word2Vec has been influential in various NLP tasks such as text classification, sentiment analysis, machine translation, and more, as it captures the semantic relationships between words by representing them in a continuous vector space.

Word2vec is not a singular algorithm, rather, it is a family of model architectures and optimizations that can be used to learn word embeddings from large datasets. Embeddings learned through word2vec have proven to be successful on a variety of downstream natural language processing tasks.

  • Traditional NLP models, like bag-of-words (BoW) and one-hot encoding, represent words as discrete symbols, which do not capture semantic relationships between words.
  • Word2Vec aims to represent words in a continuous vector space, where words with similar meanings are close to each other in this space.
  • Word2Vec models are trained on large text corpus.
  • The core idea is to learn word embeddings by optimizing a neural network’s weights through a process called backpropagation.
  • The neural network has an embedding layer that transforms words into vectors and a hidden layer that learns to predict context words or target words based on the input.

There are two kinds of algorithms that generally comes up when it comes to word2vec they are as follows:

  1. Skip-Gram algorithm
  2. Continuous Bag of words

In this Blog we will start with Continuous Bag Of Words…

Continuous Bag Of Words:

fig — 1 CBOW Architecture Credits

Word2Vec is a popular algorithm used in natural language processing (NLP) to represent words as continuous-valued vectors in a way that captures their semantic meaning. CBOW, which stands for “Continuous Bag of Words,” is one of the two main architectures used in Word2Vec, with the other being Skip-gram.

Here’s a brief explanation of the CBOW model in Word2Vec:

  1. Main Objective: The main goal of Word2Vec using CBOW, is to learn word embeddings, which are dense vector representations of words. These embeddings should capture the semantic relationships between words based on their co-occurrence patterns in a given text corpus.
  2. Architecture:
  • CBOW uses a neural network architecture. It typically has three layers: an input layer, a hidden layer, and an output layer.
  • The input to the model is a context window of words surrounding a target word. For example, if the context window size is 2, and we’re trying to predict the target word “eat” in the sentence “I love to eat pizza,” the input context will be [“love”, “to”, “pizza”]
  • Each word in the context window is represented as a one-hot encoded vector (a vector with all zeros except for a single 1 at the index corresponding to the word’s position in the vocabulary).
  • These one-hot encoded vectors are averaged to create an input vector for the neural network.

3. Training Objective:

  • The objective of the CBOW model is to predict the target word based on the averaged context vectors. This is done by training the neural network to minimize the difference between the predicted word probabilities and the actual target word’s probability distribution.
  • In other words, CBOW learns to predict a word given its context, and in doing so, it learns to capture the semantic relationships between words.

4. Word Embeddings:

  • Once the CBOW model is trained, the hidden layer weights that connect the context input layer to the hidden layer are used as the word embeddings.
  • These word embeddings are dense vectors with continuous values, and they represent words in a high-dimensional vector space. Words with similar meanings or usage patterns tend to have similar vector representations, which is why Word2Vec is valuable for NLP tasks.

Now we will understand CBOW using an Example:

“Deep learning is a subfield of machine learning.”

We want to train a CBOW model with a context window size of 2 and negative sampling. Here’s how the input and output data might look:

Vocabulary:

  • deep, learning, is, a, subfield, of, machine

Step 1: Sliding the Window: We slide the context window over the text to create training samples:

  • Training Sample 1: Input context: [“deep”, “learning”, “a”, “subfield”], Target word: “is”
  • Training Sample 2: Input context: [“deep”, “is”, “a”], Target word: “learning”
  • Training Sample 3: Input context: [“learning”, “is”, “subfield”, “of”], Target word: “a”
  • Training Sample 4: Input context: [ “is”, “a”, “of”, “machine”], Target word: “subfield”
  • Training Sample 5: Input context: [“a”, “subfield”, “machine”], Target word: “of”

Step 2: Conversion to Numerical Vectors:

  • OneHot Encoding for “deep” = [1,0,0,0,0,0,0]
  • OneHot Encoding for “learning” = [0,1,0,0,0,0,0]
  • OneHot Encoding for “is” = [0,0,1,0,0,0,0]
  • OneHot Encoding for “a” = [0,0,0,1,0,0,0]
  • OneHot Encoding for “subfield” = [0,0,0,0,1,0,0]
  • OneHot Encoding for “of” = [0,0,0,0,0,1,0]
  • OneHot Encoding for “machine” = [0,0,0,0,0,0,1]

Step 3: Input Data Format: For Training Sample 1, the input data would be:

  • Input context vectors: [1,1,0,1,1,0,0] (merging of word vectors for context words), This context vector is for the target word “is”, whose word vector is [0,0,1,0,0,0,0].

Step 4: Output Data Format: For each training sample, we create positive and negative samples.

  • Positive Sample for Training Sample 1: [[0,0,1,0,0,0,0], 1] (1 indicates it’s a positive example)
  • Negative Samples for Training Sample 1: [[0,0,0,0,0,1,0], 0], [[0,0,0,0,0,0,1], 0] (0 indicates they are negative examples)

So, for Training Sample 1, our input data is [1,1,0,1,1,0,0], and the output data consists of one positive sample ([[0,0,1,0,0,0,0], 1]) and several negative samples ([[0,0,0,0,0,1,0], 0], [[0,0,0,0,0,0,1], 0]).

Similar to last time, I will be using negative sampling for better results.

Implementation of CBOW:

This code will also use the same preprocessing steps as in the previous blog so if you have an queries about this part check out the previous blog or else comment below…

Preprocessing steps:

class Word2Vec:

def __init__(self,input_file_path,stop_words = None):
self.input_file_path = input_file_path
self.word_count = None
self.count = 0
#self.vocab_size = None
self.stop_words = stop_words
self.word_to_index = {}
self.index_to_word = {}
self.vocab = []


self.data = self._read_file(self.input_file_path)
self._Prepare_data_utils(self.data)
#self.vocab = self.vocab[:500]
#self.word_count = len(self.vocab)

def process(self,window_size):
#data = self._read_file(self.input_file_path)
#self._Prepare_data_utils(data)
return self._generate_training_data(window_size)
#return self._Augmented_Generated_data(target_vec,context_vec,labels,window_size)

def _read_file(self,remove_stop_words = False):
file_contents = []
if os.path.exists(self.input_file_path):

with open(self.input_file_path) as f:
file_contents = f.read()
data = []
for sent in file_contents.split('.'):
sent = re.findall("[A-Za-z]+", sent)
new_sent = ''
for words in sent:

if self.stop_words is not None:
if len(words) > 1 and words not in self.stop_words:
new_sent = new_sent + ' ' + words
continue
if len(words) > 1 :
new_sent = new_sent + ' ' + words
data.append(new_sent)
return data
else:
raise Exception("File Path Does Not Exist")

def _Prepare_data_utils(self,data):
for sent in data:
for word in sent.split():
word = word.lower()
self.vocab.append(word)
if word not in self.word_to_index:
self.word_to_index[word] = self.count
self.index_to_word[self.count] = word
self.count += 1
self.word_count = len(self.vocab)

def _one_hot_encode(self,target_word,context_words):
target_vector = np.zeros(len(self.vocab))
context_vector = np.zeros(len(self.vocab))
target_index = self.word_to_index.get(target_word)
for word in context_words:
context_index = self.word_to_index.get(word)
context_vector[context_index] = 1
target_vector[target_index] = 1
return target_vector,context_vector

def _generate_training_data(self,window_size,gen_negative_data = True):
target_vectors, context_vectors, labels = [],[],[]
if gen_negative_data:
for index,word in enumerate(self.vocab):
target = word
context_words = random.sample(self.vocab,window_size*2)
target_vector,context_vector = self._one_hot_encode(target,context_words)
labels.append([0])
target_vectors.append(target_vector)
context_vectors.append(context_vector)


for index,word in enumerate(self.vocab):
target = word
context_words = []
if index == 0:
context_words = [self.vocab[idx] for idx in range(index+1,index+1+window_size)]
elif index == self.word_count - 1:
context_words = [self.vocab[idx] for idx in range(index-1,index-1-window_size,-1)]
else:
#right side
for idx in range(index+1,index+1+window_size):
if idx < len(self.vocab)-1:
#print(index)
context_words.append(self.vocab[idx])
continue
break

#left side
for idx in range(index-1,index-1-window_size,-1):
if idx > 0:
context_words.append(self.vocab[idx])
continue
break
target_vector,context_vector = self._one_hot_encode(target,context_words)
labels.append([1])
target_vectors.append(target_vector)
context_vectors.append(context_vector)

return np.array(target_vectors), np.array(context_vectors), np.array(labels)

The explanation for the above code is already provided in the previous blog so please feel free to go and check out…

path_to_file = tf.keras.utils.get_file('shakespeare.txt', 'https://storage.googleapis.com/download.tensorflow.org/data/shakespeare.txt')
vectorizer = Word2Vec(path_to_file)
Autotune = tf.data.AUTOTUNE
window_size = 2
target_vectors,context_vectors,labels = vectorizer.process(window_size)

Here we use the popular Shakespeare text dataset as before. The Word2Vec class will take in the data and prepare all the necessary utilities. Once we call the process function with an appropriate window_size, we get the data from the function, we can use the data to create our dataset, currently the target_vectors contains the OneHot encodings of the target words, context_vectors will contain the corresponding context encodings(both true context and false context as discussed in the example) and labels contains 1s and 0s corresponding to positive and negative examples.

data = tf.data.Dataset.from_tensor_slices((context_vectors,(target_vectors,labels)))
data = data.cache().shuffle(5000).batch(500).prefetch(Autotune)

We create the dataset by considering the context_vectors as inputs and the target_vectors and labels as output.

def Word2VecCBOW_Model(Vocab_size,Hidden_dim):
Inp_Layer = tf.keras.layers.Input((Vocab_size,),name = "input_layer")
Embedding_Layer = tf.keras.layers.Embedding(Vocab_size,Hidden_dim,name = "Embedding_Layer_1")(Inp_Layer)
#print(Embedding_Layer.shape)
Comm_Hidden_Layer = tf.keras.layers.Dense(128,activation = "relu",name = "Common_hidden")(Embedding_Layer)
#print(Comm_Hidden_Layer.shape)

#For_Target
Target_Hidden = tf.keras.layers.Dense(64,activation="relu",name = "Target_hidden")(Comm_Hidden_Layer)
Reg = tf.keras.layers.Dropout(0.1,name = "Regularization_1")(Target_Hidden)
Target = tf.keras.layers.Dense(Vocab_size, name = "Target_Out")(Reg)

#For_Neg_or_Pos_Labels
Label_Hidden = tf.keras.layers.Dense(64,activation="relu",name = "Label_hidden")(Comm_Hidden_Layer)
Reg2 = tf.keras.layers.Dropout(0.1,name = "Regularization_2")(Label_Hidden)
Label = tf.keras.layers.Dense(1, name = "Label_Out")(Reg2)
#print(Label.shape)

CBOWWord2Vec = tf.keras.models.Model(inputs = Inp_Layer,outputs = [Target,Label])

return CBOWWord2Vec

Now this is the model we will be using and this is not a sequential model rather a functional model. Here is a break down of the architecture.

  1. Input Layer: The model starts with an input layer that takes as input a one-hot encoded vector of size Vocab_size, where each dimension corresponds to a word in the vocabulary.
  2. Embedding Layer: Next, there’s an embedding layer (Embedding_Layer_1) that converts the one-hot encoded input into dense word embeddings. It maps each word to a lower-dimensional space of size Hidden_dim. These embeddings are trainable parameters.
  3. Common Hidden Layer: The embedded vectors are passed through a common hidden layer (Common_hidden) with 128 units and ReLU activation. This layer learns common representations for both the target and context words.
  4. Target Prediction Branch:
  • Target_Hidden: This branch takes the output of the common hidden layer and passes it through another hidden layer (Target_hidden) with 64 units and ReLU activation.
  • Reg: A dropout layer with a rate of 0.1 is applied for regularization.
  • Target: The final layer of this branch is a dense layer with a softmax activation function, which aims to predict the target word given the context. It has Vocab_size units, one for each word in the vocabulary, and outputs a probability distribution over the vocabulary.

5. Label Prediction Branch:

  • Label_Hidden: This branch also takes the output of the common hidden layer and passes it through another hidden layer (Label_hidden) with 64 units and ReLU activation.
  • Reg2: A dropout layer with a rate of 0.1 is applied for regularization.
  • Label: The final layer of this branch is a dense layer with a sigmoid activation function, which aims to predict whether the context is a positive or negative example. It has a single unit (1) since it's a binary classification task.

6. Model Construction:

The model is constructed using the Keras Functional API. The input is the output of the embedding layer, and the outputs are the Target and Label branches.

Return: The function returns the constructed CBOW Word2Vec model as a keras Model object.

class Word2VecModel(tf.keras.models.Model):

def __init__(self,my_model,**kwargs):
super().__init__(**kwargs)
self.model = my_model

def compile(self,optimizer,Target_loss,Label_loss,**kwargs):
super().compile(**kwargs)
self.optimizer = optimizer
self.Target_loss = Target_loss
self.Label_loss = Label_loss

def train_step(self,batch,**kwargs):
x,y = batch
#print(x.shape)
#print(y[0].shape,y[1].shape)
with tf.GradientTape() as tape:
Target,Label = self.model(x,training = True)
batch_targetloss = self.Target_loss(tf.cast(y[0],tf.float32),Target[0])
batch_labelloss = self.Label_loss(tf.cast(y[1],tf.float32),Label[0])
total_loss = batch_targetloss + batch_labelloss

gradients = tape.gradient(total_loss,self.model.trainable_variables)
self.optimizer.apply_gradients(zip(gradients,self.model.trainable_variables))

return {"total_loss" : total_loss,"Target_loss":batch_targetloss,"Label_loss":batch_labelloss}

def test_step(self,batch,**kwargs):
x,y = batch

Target,Label = self.model(x,training = True)

batch_targetloss = self.Target_loss(y[0],Target)
batch_labelloss = self.Label_loss(y[1],Label[0])
total_loss = batch_targetloss + batch_labelloss
return {"total_loss" : total_loss,"Target_loss":batch_targetloss,"Label_loss":batch_labelloss}

def call(self,inp,**kwargs):
return self.model(inp,**kwargs)

Now here we define a custom training function that uses two loss functions Target_loss and Label_loss,

Target_loss = Softmax_crossentropy_with_logits

Label_loss = Sigmoid_crossentropy_with_logits

Now the difference between Sigmoid_crossentropy_with_logits and Binary Crossentropy is as follows.

tf.nn.sigmoid_cross_entropy_with_logits() and tf.keras.losses.BinaryCrossentropy() are not the same functions, although they are both related to binary classification tasks and involve calculating binary cross-entropy loss.

tf.nn.sigmoid_cross_entropy_with_logits():

  • This function is typically used in the context of TensorFlow’s low-level operations. It calculates the binary cross-entropy loss between predicted logits and target labels when the predictions are not yet converted to probabilities using a sigmoid activation.
  • You provide it with logits (the output of your model) and target labels (usually 0 or 1). It applies the sigmoid function internally to convert logits into probabilities and then calculates the binary cross-entropy loss.

tf.keras.losses.BinaryCrossentropy():

  • This is a high-level loss function provided by the Keras API in TensorFlow. It is used for binary classification tasks, and it assumes that your model’s final layer has a sigmoid activation function applied to it. You pass it the true labels and the predicted probabilities.
  • It does not require you to provide logits; instead, you pass the model’s output after the sigmoid activation.

We calculate the total loss as the simple linear sum of both the losses and backpropagate the gradients…

model = Word2VecCBOW_Model(500,120)
model = Word2VecModel(model)
def target_loss(y_true,y_pred):
return tf.nn.softmax_cross_entropy_with_logits(y_true,y_pred)
def label_loss(y_true,y_pred):
return tf.nn.sigmoid_cross_entropy_with_logits(y_true,y_pred)

These are the custom losses

model.compile(optimizer = tf.keras.optimizers.Adam(learning_rate = 0.001),Target_loss = target_loss,Label_loss=label_loss,metrics = ["accuracy"])
model.fit(data,epochs = 100)

Compiling and training the model…

Epoch 1/100
2/2 [==============================] - 0s 90ms/step - total_loss: 5.7537 - Target_loss: 5.0593 - Label_loss: 0.6945
Epoch 2/100
2/2 [==============================] - 0s 69ms/step - total_loss: 5.7347 - Target_loss: 5.0406 - Label_loss: 0.6942
Epoch 3/100
2/2 [==============================] - 0s 70ms/step - total_loss: 5.7525 - Target_loss: 5.0606 - Label_loss: 0.6919
.........................................................................................

................................................................................

Once the model is trained we can extract the embeddings for each word from the embedding layer as seen in the architecture.

embeddings = model.model.get_layer("Embedding_Layer_1").get_weights()
embeddings[0].shape
(500, 120)

Thus as we see, we have 500 embeddings each with a length of 120, as i considered only 500 words for training, you can iterate through the embeddings list to get all the weigths and assign them to the corresponding words or simply store them for later use…

for i,j in enumerate(list(embeddings[0])):
print(f"{vectorizer.vocab[i]}-->{j}")
first-->[ 0.03838825 -0.05927904  0.07840646  0.04125115 -0.08230995  0.07507766
0.09762581 0.045667 -0.09867255 0.0624503 0.07419965 -0.09086895
0.05858266 -0.10045303 0.09924387 0.03329634 0.08240353 -0.07684124
-0.05686071 -0.04822601 0.07820345 0.05972499 0.01935342 0.02556623
0.09234457 -0.01956867 0.04068892 -0.06160046 -0.07195235 0.03452663
-0.09757455 0.09166995 0.04848677 -0.08157872 -0.04222412 0.08523004
-0.04002318 0.08769921 0.07150321 -0.05587499 -0.09701591 -0.09103511
0.05098839 0.04793097 0.06309173 0.00383381 0.05290581 -0.08663704
-0.0557266 -0.04753168 0.05469995 0.03938609 -0.07055417 0.03451018
0.08698063 0.06349916 0.05222734 -0.06417726 0.08591545 -0.03372167
-0.04212867 -0.07595556 -0.06894468 -0.04489912 -0.06312299 0.05461016
0.07891735 -0.06174817 -0.04056623 -0.03440571 -0.0170647 -0.04761111
0.03354085 0.09936614 0.08427736 0.07575633 0.04813368 0.04192816
0.05055092 -0.04907617 -0.07640123 -0.06295482 -0.0681948 0.08776137
-0.06388512 -0.08279042 0.05266503 -0.04543241 0.06314327 0.04884192
-0.05241648 0.04910112 -0.06750758 0.07345134 0.08590982 0.07206131
0.07744514 0.06920582 -0.07456467 -0.0846033 0.04707657 -0.075852
0.06855246 0.05725192 -0.08208453 0.02843974 -0.04571832 0.06862565
-0.0751134 -0.05559238 -0.03920883 0.09439072 0.07039765 0.04190402
-0.01932508 0.06382774 0.04579802 0.08365129 0.03688437 0.06809138]
..........................................................................................
....................................................................................
..................................................................................

You can also implement more optimization by using Hierarchical softmax…

Conclusion:

In the last blog and this blog we covered the two main methods used in Word2Vec Skip-Gram and CBOW, In the upcoming blogs I will talk about other text vectorization methods, If anyone reading find it hard to comprehend what I have explained above feel free to comment below

Happy Learning!

--

--

Sanjithkumar

Deep Learning || MLOPS || GenAI Enthusiast || GCP || Azure || DOM manipulation