How to train word2vec model using gensim library

Pushpendu Das
The Startup
Published in
8 min readJun 29, 2020
AI is long way to go.

What is word embedding ?

Word embedding is a process of understanding the text by machine learning algorithms, also we say word embedding is a learned representation for text where it capture the context of the word in a document, semantic and syntactic similarity as well as relation between other words in corpus.

This approach holds the key to solve the natural language processing problems using machine learning and deep learning algorithm. Basically, word embedding is a vector representation of words. By calculating the distance or position of words, it represents meaning words, sentences, paragraphs or documents.

There are several techniques of word embedding has been introduced till now. Generally, we say that there are two types of word embedding -

  • Static Word Embedding
  • Contextual Word Embedding

1. Static Word Embedding: Traditional methods such as Skip-Gram and Continuous Bag-of-Words learn static embedding by training lookup tables that translate words into dense vector. Static embedding are directly useful for solving lexical semantics task.

Problem: Can’t resolve ambiguities for polysemous words. Like when a word represents multiple meaning in two different sentences then Skip-Gram or BOW methods get failed.

2. Contextual Word Embedding: Contextualised words embedding aim at capturing word semantics in different contexts to address the issue of polysemous and the context-dependent nature of words. LSTM, Bi-directional LSTM etc model helps get the vector form of words.

Word2Vec Model: Word2Vec is a method to construct such a static word embedding. This has been achieved using two methods Skip Gram and Common Bag of Words (CBOW) with help of neural networks. It was developed by Tomas Mikolov in 2013 at Google.

Why do we need them?

Let’s say we have below sentences.

“I love coding in Jupyter.” and “I enjoy coding in Pycharm”.

Both sentences are very close to each other. If we build an exhaustive vocabulary (let’s call it V), it will have V = {I, love, enjoy, coding, in, Jupyter, Pycharm}. If we go with one hot encoding. We won’t be able to get the exact meaning as love, enjoy, coding each word are being treating as same, while love and enjoy and very close to each other.

Our motive is to have words with similar context will be very close spatial position. Mathematically, the cosine distance will identify how close two words are. Closest word’s Cosine distance or Cosine of the angel will be close to 0 and for farthest it will be close to 1.

How does Word2Vec work?

Word2Vec model can run on two algorithm -

  • Skip-gram
  • CBOW (Continuous Bag of Words)

CBOW Model: This method takes the context of each word as the input and tries to predict the word corresponding to the context. In the process of predicting the target word, we learn the vector representation of the target word. As input multiple word will be there as per window size(5 words) but it will return one word as output.

Let’s look deeper architecture:

Skip-Gram model: This another algorithm for Word2Vec. It just opposite of CBOW model. Model will take one word as input, but it will return multiple words as per window size.

Implementation of word Embedding with Gensim Word2Vec Model:

Here, I will explain step by step how train word2vec model using Gensim. Dataset, I have collected from Kaggle platform. This dataset in from Myers-Briggs Type Indicator public data. It contains two columns type and post. “type” defines 16 type of personality and “post “defines comment from those 16 types of personalised individual.

1. Data Loading and Data Description

# Loading the dataset
data_df = pd.read_csv(“data/mbti_1.csv”)
data_df.head()

2. Data Cleaning and Pre-processing

After loading the data, need to find if data contains NA value, if NA value within dataset, we will drop it.

# Removing na values from dataframe
def data_na_value_cleaning(data):
print(“\nBefore cleaning, Data Shape : “, data.shape)
print(“\nBefore removing Null values: — — — — — — — -”)
print(data.isna().sum())

data.dropna(inplace=True)
data.reset_index(inplace=True,drop=True)

print(“After removing Null values: — — — — — — — -”)
print(data.isna().sum())
print(“\nAfter cleaning, Data Shape : “, data.shape)

return data

As our data in not having any NA values. So, no rows has been removed. Now, we check if there are any duplicate values in Dataset.

# Removing duplicate values
def duplicate_content_removal(data, col, ini_row):
print(“\nBefore removing duplicates, number of data was : “, ini_row)
duplicate_count = data[col].duplicated().sum()
print(“\nNumber of Duplicates: “, duplicate_count)

description_data = data[col].drop_duplicates()
cleaned_row = len(description_data)

if (ini_row — cleaned_row) > 0:
print(“\nTotal data reduction : “, (ini_row — cleaned_row))
print(“\nAfter removing duplicates, number of data is :”, cleaned_row)
else:
print(“\nDataset doesn’t content any duplicate data.”)

return list(description_data)

posts = duplicate_content_removal(data_df, ‘posts’, data_df.shape[0])

Now as part of cleaning process we will remove links and punctuation. Also, for better training only we will consider words not number and alphanumeric words. We will remove stopwords for better model understanding and accuracy.

def remove_link_punc(string):
# removing links
temp_string = re.sub(‘http[s]?://(?:[a-zA-Z]|[0–9]|[$-_@.&+]|(?:%[0–9a-fA-F][0–9a-fA-F]))+’, ‘ ‘, string)

# removing all everything except a-z english letters
regex = re.compile(‘[^a-zA-Z]’)
temp_string = regex.sub(‘ ‘, temp_string)

# removing extra spaces
clean_string = re.sub(‘ +’, ‘ ‘, temp_string).lower()

return clean_string

Next method data_cleaning

def data_cleaning(content):
sentences = []
for idx in tqdm(range(len(content))):

if content[idx] !=””:
# Sentence tokenization using NLTK library
for each_sent in sent_tokenize(str(content[idx])):

if each_sent != “”:
temp_sent = []
# Removing link and punctuation
each_sent = remove_link_punc(each_sent.lower())

# Removing stopwords and applying lemmatization
for each_word in each_sent.split():
if each_word not in stop_words and len(each_word)>= 3:
temp_sent.append(lemmatizer.lemmatize(each_word))

# Only taking word list length is greater than equals to 5
if len(temp_sent) >= 5:
sentences.append(temp_sent)

return sentences

sent_corpus = data_cleaning(posts)

Trying to get each word in a sentence.

# Sentence words stats
len_count = []
for l in sent_corpus:
len_count.append(len(l))

print(“Total number of Sentences : “, len(len_count))
word_sent_df = pd.DataFrame(sorted(Counter(len_count).items()), columns=[“No of Words in each Sentence”,”No of sentence”])
word_sent_df.head(10)

After cleaning and data preprocessing, data looks like each word in a lint. Each list is representation of a sentence. Genism word2vec requires that a format of ‘list of lists’ for training where every document is contained in a list and every list contains lists of tokens of that document.

3. Model Training

Now we will train word2vec model using Gensim library with our own dataset.

model = Word2Vec(sentences=sent_corpus, size=200, window=4, min_count=1, workers=4)

sentences : where we can pass our prepared dataset which is sent_corpus

size : Dimension of generated vector form of each word, by default size is 100.

window : Maximum distance between the current and predicted word within a sentence, default value is 5

min_count : Ignore all the words where frequency of each word is less than min_count, default value is 5. As we wanted to add all words in corpus, so value we provided is 1.

workers : Uses those many threads to train the model. Default value is 3

sg : Used to choose the training algorithm: 1 for skip-gram; 0 for CBOW. By default CBOW is used for training.

In my system it took around 38 seconds 395702 sentences. Training time depends on system requirements.

4. Vector form word

After model training is completed. We can go to get vectorized form of each word. There are two method to get the vector form. Shape of the vector will be 200 as per given size.

# Get vector form of word king
model.wv.get_vector(‘king’)

Or

# Another way to get vectorized form of word2vec
model.wv.word_vec(‘king’)

5. Similarity and distance between two words

We will provide two list of words ‘king’,’male’ in one list and ‘queen’,’female’ in another list. We will try to find out how much similar those two list words.

# List of word similarity
model.wv.n_similarity([‘king’,’male’],[‘queen’,’female’])

Here we will look, what is the distance between two words.Similar kind of words will have less distance.

# Distance between two words
model.wv.distance(‘king’,’queen’)

If we want to get similar then then we can use below code.

# Similar word for king
model.wv.similar_by_word(‘king’)

6. Other techniques using word2Vec

# Odd one out from list of words
model.wv.doesnt_match([“king”, “george”,”stephen”,”truck”])

This will help us to find odd words present in a list.

# word pairs evaluation
model.wv.evaluate_word_pairs(‘data/SimLex-999/SimLex-999_2.txt’,)

This will help us tp evaluate the word pair

# Words analogy from list of words
model.wv.evaluate_word_analogies(‘data/questions-words.txt’)

This will be useful for word analogy

7. Loading trained model

If we need to use trained model, which can be achieved using below code.

# Loading google pre-train model
from gensim import models

models.KeyedVectors.load_word2vec_format(‘data/GoogleNews-vectors-negative300.bin’, binary=True)

Here I tried to load google pre-trained model. We can download from link. This model is also very powerful and trained on huge dataset.

For more details about the implementation, you can have a look at my code on github.

Thanks for reading, please let me if question or doubt is there. I will be happy to answer your questions.

--

--