The Good Food Economy

As we’re set out to revolutionise food and tech in the world — sharing some of our experiences with the hope to learn from the broader community :)

Skipgram implementation from scratch — Pytorch

12 min readMay 23, 2023

--

Word2Vec (Source: https://monkeylearn.com/blog/word-embeddings-transform-text-numbers/)

In recent times, there has been an exponential increase in the use cases pertaining to Natural Language Processing. With this, word embeddings have revolutionized the way we understand and represent language in computational models. Among the various techniques used for word embeddings, skip-gram has emerged as a powerful approach for capturing rich semantic relationships between words. By leveraging the context in which words appear within a corpus, skip-gram enables us to create dense and meaningful vector representations that capture the nuances of word semantics. In this article, we will explore some basics related to the skip-gram method and implement the same from scratch using pytorch in Python.

Before jumping to the implementation, let’s understand some pointers that are important to get a better understanding.

Cross-Entropy Loss

In the realm of natural language processing and word embedding, cross-entropy loss plays a crucial role in training skip-gram models. Cross-entropy loss is a measure used to quantify the dissimilarity between the predicted probabilities and the true distribution of words in a given context.

Cross-Entropy loss penalizes the model for poor prediction as it calculates the negative log of the predicted probability of the observed class. To understand this better, let’s plot a Cross-Entropy loss graph for a list of predicted probabilities.

#import the libraries
import numpy as np
import matplotlib.pyplot as plt

#create a list of 100 prob between 0.01 and 0.99
predicted_probs = np.linspace(0.01, 0.99, 100)

#calculate the cross entropy for each probability
cross_entropy_loss = -np.log(predicted_probs) #ce takes the predicted probability of the observed class.

#plot the curve cross entropy curve
plt.plot(predicted_probs, cross_entropy_loss)
plt.xlabel('Predicted Probability for Correct Class')
plt.ylabel('Cross-Entropy Loss')
plt.title('Cross-Entropy Loss vs. Predicted Probability')
plt.grid(True)
plt.show()

As we can see in the above graph, however, the loss is nearly 0 for good predictions, the CE loss explodes when the prediction is poor with low very predicted probability.

One more point to note is that Cross-Entropy loss is equivalent to a combination of LogSoftmax and Negative Log Likelihood loss (This is implemented below).

#Import the libraries
import torch
import torch.nn as nn
import torch.nn.functional as F

# Define output logits and target labels
output_logits = torch.tensor([[1.2, 0.5, 2.1]])
target_labels = torch.tensor([2])

# Applying CrossEntropyLoss
criterion = nn.CrossEntropyLoss()
loss_ce = criterion(output_logits, target_labels)

# Applying Softmax and NLLLoss separately
softmax_output = F.softmax(output_logits, dim=1)
loss_nll = F.nll_loss(torch.log(softmax_output), target_labels)

print("Cross Entropy Loss: {}".format(loss_ce.item()))
print("LogSoftmax + Negative Log Likelihood Loss: {}".format(loss_nll.item()))

As we can see, both the losses are equal.

Understanding Skipgram Model

Now, let’s spend some time understanding the intuition behind the skip-gram model.

As we all know, the skip-gram model uses the center word in a sequence to predict the surrounding words also known as the context words. So, let’s see how we create the context and the target words.

It might not be difficult for you to recollect instances where you had to read a complete sentence to get to the meaning of a particular word in that sentence. The same is the intuition behind the skip-gram and CBOW models. Now, let’s understand how we create the training dataset for skip-gram

Let’s take a sentence, for example.

“The quick brown fox jumps over a lazy dog”

For the starting point, we define the Number of context words (will take this number as 2 for this example) before and after the center word that is to be predicted by the skipgram model. Then, we slide the window to the next center word till the end of the sequence. This happens as below.

Centre and the surrounding words

In the above figure, the word colored in blue is the center word which will be used to predict the surrounding or the context words highlighted in red. Here, we have a total of 5 sets of center and surrounding words.

Now, to create the training data, we place the center word against each context word for each set of center and surrounding words. Considering only the first two sentences, we get the below data.

Training Data

The above data is ready and can be fed into the model for training.

Skipgram Implementation in PyTorch

As we move to the skip-gram implementation, the next part of the blog will focus more on various helper functions defined to train the model and some part of it will focus on visualizing the model and its training on tensorboard.

To begin with, let us define some helper functions which will be used in the processing of the text, tokenizing the text, and also creating the vocabulary and work indices for converting the sequence of text to a sequence of numbers. Let’s look at all the helper functions one by one.

Note: All the functions have been named such that they become more self-explanatory

#model config
def model_config(model):
"""
This function contains the model configuration
"""
model_config = {"model_name": model,
"train_batch_size": 2,
"val_batch_size": 2,
"shuffle": True,
"learning_rate": 5e-4,
"epochs": 500,
"train_steps":1,
"val_steps":1,
"checkpoint_frequency": 1,
"model_dir": "weights/{}".format(model)}
return model_config

def get_model(model_name):
"""
This function returns the class of the model
"""
if model_name == "cbow":
return CBOW_Model
if model_name == "skipgram":
return SkipGram_Model
else:
raise ValueError("Choose from either cbow or skipgram")

The above function simply defines the configs of the model. This includes the model name, batch size, learning rate, epochs, etc.

#creating the vocab
def tokenizer(text, min_occurance):
"""
This function simply tokenized the sentence into tokens after lowering them.

Min occurance: this is used to filter the low occuring words
"""
tokens = nltk.word_tokenize(text.lower())
vocab = []
for token in tokens:
if tokens.count(token) > min_occurance:
vocab.append(token)
return vocab

#creating the word_index
def word_indices(text, special_tokens_list=None):
"""
This functions returns the word indices and unique words in the text i.e. the vocab of the provided text. It converts the text into a dict with tokens as the keys
and index numbers as the value.
"""
# tokens = list(set(tokenizer(text, 0))) + special_tokens_list
if special_tokens_list is None:
tokens = list(set(tokenizer(text, 0)))
updated_vocab = {index: value for index, value in enumerate(sorted(tokens))}
return updated_vocab, tokens
else:
tokens = list(set(tokenizer(text, 0))) + special_tokens_list
special_vocab = {value: index for index, value in enumerate(sorted(tokens))}
return special_vocab, tokens

#convert the words to numbers
def token_to_number(text, word_index):
"""
This function converts a given text to a sequence of numbers

ex: This is my house
Output: [23, 45, 3, 6] #these are the index of the respective numbers in the vocab
"""
text_token_ids = [word_index[token] if token in word_index.keys() else word_index["<unk>"] for token in tokenizer(text, 0)]
return text_token_ids

The final function i.e. token_to_number converts a sequence of tokens/words into a sequence of numbers based on the index of a particular word in the word indices defined. Let’s run this function to see the output.

A sequence of tokens to a sequence of numbers

Here, we define the vocabulary of the model based on the text that is given as input and also create word indices in which we assign an index to each word in the vocabulary.

Note: The function also takes a list of special tokens, if required. Like a <unk> token is assigned to a word that is not present in the vocabulary. These special tokens are also added to the vocabulary and the word indices along with the other words.

Let’s now create the input and output for the model.

def skipgram(batch, word_index, MAX_SEQUENCE_LENGTH, skipgram_context_words):

input_tensor, output_tensor = [], []

for text in batch:
text = text.strip()
text_token_ids = token_to_number(text, word_index)

if len(text_token_ids) < skipgram_context_words*2+1:
continue

if len(text_token_ids) > MAX_SEQUENCE_LENGTH:
text_token_ids = text_token_ids[:MAX_SEQUENCE_LENGTH]

for ids in range(len(text_token_ids)-skipgram_context_words*2):

sequence = text_token_ids[ids: (ids + skipgram_context_words * 2 + 1)]

context = sequence.pop(skipgram_context_words)
target = sequence

# return target, context
for output in target:

output_tensor.append(output)
input_tensor.append(context)

target_tensor = torch.tensor(output_tensor, dtype=torch.long)
context_tensor = torch.tensor(input_tensor, dtype=torch.long)

return context_tensor, target_tensor

The above function takes the batch of sentences as input with other parameters and returns the input and the output tensors required by the skip-gram model.

Let’s visualize the output of the skip-gram function.

#visualise the output of the skipgram function

input_tensor, output_tensor = skipgram(["dewy and cool beneath my feet, and the sky above was a beautiful shade of pale blue",
"n the distance, I could see a small cluster of buildings, and as I got closer, I realized it was a quaint little village"], word_index, MAX_SEQUENCE_LENGTH, skipgram_context_words)

print("Input: {}\nOutput: {}".format(input_tensor, output_tensor))

Note: We have taken the number of surrounding words to be considered as 4.

Input and output for the skip-gram model

Since we have taken the number of surrounding words as 4, the input tensor i.e. the center word for 8 different surrounding words in the output tensor.

Now, we will define a data loader that will load the data in batches which will be given as input to the above skip-gram function.


#creating the dataloader
class word2vec_dataset(Dataset):
def __init__(self, text):
self.data = text
self.word_index, self.vocab = word_indices(self.data, ["<unk>"])
self.vocab_len = len(self.vocab)
self.split = self.data.split("\n")

def __getitem__(self, idx):
data = self.split[idx]
return data

def __len__(self):
return len(self.split)

def dataloader_vocab(text_doc, model, MAX_SEQUENCE_LENGTH, context_words, params, vocab=None):

if model == "cbow":

my_dataset = word2vec_dataset(text_doc)
data_params = params
dataloader = DataLoader(my_dataset, collate_fn=lambda x: cbow(x, my_dataset.word_index, MAX_SEQUENCE_LENGTH, context_words), **data_params)
if not vocab:
return dataloader, my_dataset.vocab
else:
return dataloader, vocab

elif model == "skipgram":

my_dataset = word2vec_dataset(text_doc)
data_params = params
dataloader = DataLoader(my_dataset, collate_fn=lambda x: skipgram(x, my_dataset.word_index, MAX_SEQUENCE_LENGTH, context_words), **data_params)
if not vocab:
return dataloader, my_dataset.vocab
else:
return dataloader, vocab
else:
raise ValueError("Choose a model between cbow and skipgram")

The above function defines a word2vec_dataset class that inherits the Dataset class of PyTorch to create a dataset. This class contains a __getitem__ function which returns the data by index after splitting the text by new lines.

Next, we define the data loader function which takes the dataset created by the word2vec_dataset and loads the same data in batches on the basis of the batch size mentioned in the data params provided as input to the model. It then uses the collate function i.e. skip-gram function in our case to give the input and output tensors.

Define the model

Last, but not least, we create model classes and define the layers in each of the models. Here, we will work with the SkipGram_Model class.

EMBED_MAX_NORM = 1
EMBED_DIMENSION = 300
class CBOW_Model(nn.Module):
"""
Implementation of CBOW model described in paper:
https://arxiv.org/abs/1301.3781
"""
def __init__(self, vocab_size: int):
super(CBOW_Model, self).__init__()
self.embeddings = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=EMBED_DIMENSION
)
self.linear = nn.Linear(
in_features=EMBED_DIMENSION,
out_features=vocab_size,
)
# self.softmax = nn.Softmax(dim=1)

def forward(self, inputs_):
x = self.embeddings(inputs_)
x = x.mean(axis=1)
x = self.linear(x)
# x = self.softmax(x)
return x


class SkipGram_Model(nn.Module):
"""
Implementation of Skip-Gram model described in paper:
https://arxiv.org/abs/1301.3781
"""
def __init__(self, vocab_size: int):
super(SkipGram_Model, self).__init__()
self.embeddings = nn.Embedding(
num_embeddings=vocab_size,
embedding_dim=EMBED_DIMENSION
)
self.linear = nn.Linear(
in_features=EMBED_DIMENSION,
out_features=vocab_size,
)

def forward(self, inputs_):
x = self.embeddings(inputs_)
x = self.linear(F.relu(x))
return x

Let’s see the different layers we have defined in our skip-gram architecture and their shapes. For that, we have to first define different parameters and load the model.

#define the model config
config = model_config('skipgram')

#define the dataloader parameters
params = {
'batch_size': config['train_batch_size'],
'shuffle': True,
'num_workers': 0
}

#define the dataloader and the vocab
data_l, vocab = dataloader_vocab(text_doc, config['model_name'], MAX_SEQUENCE_LENGTH, cbow_context_words, params, None)

#define the model and the model class
model_class = get_model(config['model_name'])
model = model_class(len(vocab))
model

We get the below model output;

skip-gram model

Now, let’s understand the shape of the different layers of the model.

#get the layers and their shapes

for name, params in model.named_parameters():
print(name, ":", params.shape)
Skipgram model layers

As we can see, we have defined 2 layers, an embedding layer, and a linear layer.

Note: 321 is the size of the vocabulary and 300 is the EMBED_DIMENSION

Now, we will load a single batch into the model. We can load a batch by running one iteration on the data loader.

#load a single batch
inputs, _ = next(iter(data_l))
print("Shape of the input: {}".format(inputs.shape))

#initiate a summary writer from the tensorboard library
tb_writer = SummaryWriter('runs/skipgram_implementation')

#add the model graph on the tensorboard
tb_writer.add_graph(model, inputs)
Input size

Now, we can run the magic command in Python to get the model graph in the tensorboard.

%load_ext tensorboard
%tensorboard --logdir runs/ --port 8000

After running the above command, we get a beautiful visualization on tensorboard. (A gif of the same is below)

Skipgram model architecture

As you can see, we have an input layer that takes the input of the above shape i.e. [272] and gives the final output of size 272 X 321. This is because, for each word, given as input to the model, we have a probability distribution equal to the size of the vocabulary.

Now, without much ado, let’s create final training and validation functions to train the model.

import os
import numpy as np
import json
import torch


class Trainer:
"""Main class for model training"""

def __init__(
self,
model,
epochs,
train_dataloader,
train_steps,
val_dataloader,
val_steps,
checkpoint_frequency,
criterion,
optimizer,
device,
model_dir,
model_name,
):
self.model = model
self.epochs = epochs
self.train_dataloader = train_dataloader
self.train_steps = train_steps
self.val_dataloader = val_dataloader
self.val_steps = val_steps
self.criterion = criterion
self.optimizer = optimizer
self.checkpoint_frequency = checkpoint_frequency
self.device = device
self.model_dir = model_dir
self.model_name = model_name

self.loss = {"train": [], "val": []}
self.model.to(self.device)

def train(self):
for epoch in range(self.epochs):
self._train_epoch(epoch)
self._validate_epoch(epoch)
print(
"Epoch: {}/{}, Train Loss={:.5f}, Val Loss={:.5f}".format(
epoch + 1,
self.epochs,
self.loss["train"][-1],
self.loss["val"][-1],
)
)

# self.lr_scheduler.step()

if self.checkpoint_frequency:
self._save_checkpoint(epoch)

def _train_epoch(self, epoch):
self.model.train()
running_loss = []

for i, batch_data in enumerate(self.train_dataloader, 1):
inputs = batch_data[0].to(self.device)
labels = batch_data[1].to(self.device)
outputs = self.model(inputs)
loss = self.criterion(outputs, labels)
loss.backward()
self.optimizer.step()
self.optimizer.zero_grad()

running_loss.append(loss.item())

if i == self.train_steps:
break

epoch_loss = np.mean(running_loss)
tb_writer.add_scalar('Training loss',
epoch_loss,
epoch)
self.loss["train"].append(epoch_loss)

def _validate_epoch(self, epoch):
self.model.eval()
running_loss = []

with torch.no_grad():
for i, batch_data in enumerate(self.val_dataloader, 1):
inputs = batch_data[0].to(self.device)
labels = batch_data[1].to(self.device)

outputs = self.model(inputs)
loss = self.criterion(outputs, labels)

running_loss.append(loss.item())
if i == self.val_steps:
break

epoch_loss = np.mean(running_loss)
tb_writer.add_scalar('Validation loss',
epoch_loss,
epoch)
self.loss["val"].append(epoch_loss)

def _save_checkpoint(self, epoch):
"""Save model checkpoint to `self.model_dir` directory"""
epoch_num = epoch + 1
if epoch_num % self.checkpoint_frequency == 0:
model_path = "checkpoint_{}.pt".format(str(epoch_num).zfill(3))
model_path = os.path.join(self.model_dir, model_path)
torch.save(self.model, model_path)

def save_model(self):
"""Save final model to `self.model_dir` directory"""
model_path = os.path.join(self.model_dir, "model.pt")
torch.save(self.model, model_path)

def save_loss(self):
"""Save train/val loss as json file to `self.model_dir` directory"""
loss_path = os.path.join(self.model_dir, "loss.json")
with open(loss_path, "w") as fp:
json.dump(self.loss, fp)

Note: We have also added the summary writer in the training function to log both the training and validation loss which will be visualized once the training is completed.

import os
import torch
import torch.nn as nn

def train(config):

os.makedirs(config["model_dir"])

train_params = {
'batch_size': config['train_batch_size'],
'shuffle': True,
'num_workers': 0
}

val_params = {
'batch_size': config['val_batch_size'],
'shuffle': True,
'num_workers': 0
}

#dataloaders
train_dataloader, vocab = dataloader_vocab(text_doc, config['model_name'], MAX_SEQUENCE_LENGTH, cbow_context_words, train_params, None)
val_dataloader, vocab = dataloader_vocab(text_doc, config['model_name'], MAX_SEQUENCE_LENGTH, cbow_context_words, val_params, vocab)

vocab_size = len(vocab)
print(f"Vocabulary size: {vocab_size}")

model_class = get_model(config['model_name'])
model = model_class(vocab_size=vocab_size)
criterion = nn.CrossEntropyLoss()

optimizer = optim.Adam(model.parameters(), lr=config['learning_rate'])

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

trainer = Trainer(
model=model,
epochs=config['epochs'],
train_dataloader=train_dataloader,
train_steps=config["train_steps"],
val_dataloader=val_dataloader,
val_steps=config["val_steps"],
criterion=criterion,
optimizer=optimizer,
checkpoint_frequency=config["checkpoint_frequency"],
device=device,
model_dir=config["model_dir"],
model_name=config["model_name"],
)

trainer.train()
print("Training finished.")
print("Model artifacts saved to folder:", config["model_dir"])

return model, vocab_size

Once we have defined the final training and validation functions, we can run the above function to train our skip-gram model.

#setting the config
config_model = model_config("skipgram")

#saving the model

model, vocab_size = train(config)
tb_writer.close()

Finally, let’s visualize our training and validation loss that we logged during training on the tensorboard summary writer (run the same tensorboard command).

Training Loss
Validation loss

As we can see, with every additional epoch, both the training loss and validation loss are decreasing which tells us that we have successfully trained our skipgram model from scratch using PyTorch in Python.

This brings us to the end of this blog. I have tried to be as detailed as possible to make this blog more understandable to you. Still, if you feel any need to visit the code and understand how the CBOW model is implemented, feel free to click on the below GIT repo link.

Repo: https://github.com/mayank1903/word2vec_implementation_from_scratch

Also, you can read more of my interesting articles/blogs. Some of the links are added below;

  1. Anomaly detection in Images — Using feature descriptors: https://medium.com/farmart-blog/anomaly-detection-in-images-using-feature-descriptors-c7124669bdfb
  2. Profiling — Key to Python code optimization: https://medium.com/farmart-blog/profiling-key-to-python-code-optimisation-ff635679d185

I will be back with more such interesting blogs and to help the community at large, until then,

Happy Reading ;)

--

--

The Good Food Economy
The Good Food Economy

Published in The Good Food Economy

As we’re set out to revolutionise food and tech in the world — sharing some of our experiences with the hope to learn from the broader community :)

mayank khulbe
mayank khulbe

Written by mayank khulbe

🔍🧠 Data Scientist | Unraveling Insights in the World of ML & DL 🚀 | Transforming Data into Actionable Intelligence | Passionate of Solving Complex Problems