Analytics Vidhya
Published in

Analytics Vidhya

image:mortentolboll.weebly

Word2Vec in Natural Language Processing

Overview

In the previously discussed topics ,in Bag of Words (BOW)& TF-IDF approach, semantic information is not stored. Here BOW give equal preference to each words in corpus where as TF-IDF gives importance to uncommon words.

Semantic means that in a sentence the order & relation of words are important. Like if I have a sentence say “He is going to Collage” it’s important to have ordering between words in this sentence

There is also chance of over fitting with BOW & TF-IDF

Solution for both of above issues is Word2Vec

Introduction & Working of Word2Vec

1.In word2vec,each word id basically represented as a vector of 32 or more dimension instead of single number

2.In Word2Vec,the semantic information and relation between different words is also preserved.

Visual Representation of Word2Vec:

image:researchgate.net

If you see in the picture above,we have tried to show words in 3 dimensions.Since words “newspaper” and “magazine” are related, they have nearby vector (distance between vectors for these two words are very close) representation where as the word “biking” which is not related to above two is having completely different vector representation.So we can say there is a semantic information and relation exists between words “newspaper” and “magazine”

Steps to create Word2Vec

Step1 : Tokenization of sentences in the corpus.

Step2: Create Histograms

Step3:Take most frequent words from the corpus

Step4: Create a matrix with all the unique words.It also shows the relation between the word based on occurrences

Lets do it with the Python,Gensim library.

Import all the required Libraries

Corpus/Paragraph:

Clean the Paragraph above:

Sentences will look like this:

Using Tokenizer to split sentences into words and using Stopwords to remove frequent occurring unwanted words(like the,he she,we,is,etc)

Finally using Word2Vec

Vectors representation of each word.Each word, here “freedom” is represented in 100 dimensions

Just to show you all the 100 dimension values of word “freedom”

array([ 1.1273683e-03, -3.3408357e-03,  3.1538198e-03,  1.6635142e-03,
-1.5097707e-04, -2.3732651e-03, -5.5288838e-04, 4.2790128e-03,
-1.6280514e-03, -2.5043052e-03, -1.2443076e-03, 3.7798416e-03,
2.4867388e-03, 4.6275570e-03, -2.0362085e-03, -1.1484585e-03,
-3.0532812e-03, 1.7743394e-03, -1.1969920e-03, 1.6191329e-03,
3.2258648e-03, -1.5515186e-03, -3.7306850e-04, 4.0565613e-03,
-4.5308433e-03, 2.7869337e-03, -2.6286333e-03, -1.4752239e-03,
-3.0462523e-03, -7.1018201e-04, 4.0662824e-03, 2.4954581e-03,
-4.1038552e-03, -2.8832494e-03, -2.1366167e-03, -4.3516876e-03,
-1.2155144e-03, 4.9223285e-03, 1.2021879e-03, 1.9537236e-03,
2.6177356e-03, 3.5373569e-03, -4.1266498e-03, -7.0183648e-04,
-3.5120137e-03, 1.4333301e-03, -2.7147203e-03, 1.5479618e-03,
3.6891426e-03, 3.7910854e-03, -1.3579437e-04, -3.6631080e-03,
5.8001833e-04, 1.0410204e-03, 3.0223157e-03, 1.0503514e-03,
-4.8348093e-03, -7.5404608e-04, -2.5279538e-03, 4.6469667e-03,
3.5378032e-03, 4.7412640e-03, -2.0815984e-03, -4.1108266e-03,
-4.5497515e-03, -2.0349291e-03, 4.8185606e-03, -3.5920267e-03,
2.0674071e-03, 1.9790779e-03, -3.9039373e-03, 3.4050874e-03,
3.8651349e-03, 3.6706368e-03, 4.2692507e-03, -3.9807847e-03,
7.2977535e-05, 2.1913229e-03, 2.3057887e-03, -1.3587050e-04,
-4.7944724e-03, 1.2130835e-03, -1.8126203e-03, -2.1072873e-03,
-2.2353262e-03, -2.9427181e-03, 6.3250802e-04, 5.5979716e-04,
3.3508011e-03, -9.0776308e-04, -4.8847585e-03, 1.9552025e-03,
-2.2549990e-03, -4.4488683e-03, 1.2665773e-03, -1.4139886e-03,
1.4697905e-03, -4.8091747e-03, 3.0768490e-03, 6.0989073e-04],
dtype=float32)

Finding Similar words like word “freedom”

So we can see here that we are getting semantic and relationship information here with word2vec and we are able to find the related words.In this example its “freedom”

Conclusion: Word2vec is very useful in finding semantic and relationship information between words and heavily used in AI & NLP application like Survey responses, comment analysis, recommendation engines, and more.

Please write your queries & comments and share your feedback.

Hope you like my article.Please hit Clap 👏(50 times) to motivate me to write further.

Want to connect :

Linked In : https://www.linkedin.com/in/anjani-kumar-9b969a39/

If you like my posts here on Medium and would wish for me to continue doing this work, consider supporting me on patreon

TF-IDF Link

Bag of Words Link

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Anjani Kumar

Anjani Kumar

Data Science ,ML & NLP Enthusiastic