Word2Vec in Natural Language Processing
In the previously discussed topics ,in Bag of Words (BOW)& TF-IDF approach, semantic information is not stored. Here BOW give equal preference to each words in corpus where as TF-IDF gives importance to uncommon words.
Semantic means that in a sentence the order & relation of words are important. Like if I have a sentence say “He is going to Collage” it’s important to have ordering between words in this sentence
There is also chance of over fitting with BOW & TF-IDF
Solution for both of above issues is Word2Vec
Introduction & Working of Word2Vec
1.In word2vec,each word id basically represented as a vector of 32 or more dimension instead of single number
2.In Word2Vec,the semantic information and relation between different words is also preserved.
Visual Representation of Word2Vec:
If you see in the picture above,we have tried to show words in 3 dimensions.Since words “newspaper” and “magazine” are related, they have nearby vector (distance between vectors for these two words are very close) representation where as the word “biking” which is not related to above two is having completely different vector representation.So we can say there is a semantic information and relation exists between words “newspaper” and “magazine”
Steps to create Word2Vec
Step1 : Tokenization of sentences in the corpus.
Step2: Create Histograms
Step3:Take most frequent words from the corpus
Step4: Create a matrix with all the unique words.It also shows the relation between the word based on occurrences
Lets do it with the Python,Gensim library.
Import all the required Libraries
Clean the Paragraph above:
Sentences will look like this:
Using Tokenizer to split sentences into words and using Stopwords to remove frequent occurring unwanted words(like the,he she,we,is,etc)
Finally using Word2Vec
Vectors representation of each word.Each word, here “freedom” is represented in 100 dimensions
Just to show you all the 100 dimension values of word “freedom”
array([ 1.1273683e-03, -3.3408357e-03, 3.1538198e-03, 1.6635142e-03,
-1.5097707e-04, -2.3732651e-03, -5.5288838e-04, 4.2790128e-03,
-1.6280514e-03, -2.5043052e-03, -1.2443076e-03, 3.7798416e-03,
2.4867388e-03, 4.6275570e-03, -2.0362085e-03, -1.1484585e-03,
-3.0532812e-03, 1.7743394e-03, -1.1969920e-03, 1.6191329e-03,
3.2258648e-03, -1.5515186e-03, -3.7306850e-04, 4.0565613e-03,
-4.5308433e-03, 2.7869337e-03, -2.6286333e-03, -1.4752239e-03,
-3.0462523e-03, -7.1018201e-04, 4.0662824e-03, 2.4954581e-03,
-4.1038552e-03, -2.8832494e-03, -2.1366167e-03, -4.3516876e-03,
-1.2155144e-03, 4.9223285e-03, 1.2021879e-03, 1.9537236e-03,
2.6177356e-03, 3.5373569e-03, -4.1266498e-03, -7.0183648e-04,
-3.5120137e-03, 1.4333301e-03, -2.7147203e-03, 1.5479618e-03,
3.6891426e-03, 3.7910854e-03, -1.3579437e-04, -3.6631080e-03,
5.8001833e-04, 1.0410204e-03, 3.0223157e-03, 1.0503514e-03,
-4.8348093e-03, -7.5404608e-04, -2.5279538e-03, 4.6469667e-03,
3.5378032e-03, 4.7412640e-03, -2.0815984e-03, -4.1108266e-03,
-4.5497515e-03, -2.0349291e-03, 4.8185606e-03, -3.5920267e-03,
2.0674071e-03, 1.9790779e-03, -3.9039373e-03, 3.4050874e-03,
3.8651349e-03, 3.6706368e-03, 4.2692507e-03, -3.9807847e-03,
7.2977535e-05, 2.1913229e-03, 2.3057887e-03, -1.3587050e-04,
-4.7944724e-03, 1.2130835e-03, -1.8126203e-03, -2.1072873e-03,
-2.2353262e-03, -2.9427181e-03, 6.3250802e-04, 5.5979716e-04,
3.3508011e-03, -9.0776308e-04, -4.8847585e-03, 1.9552025e-03,
-2.2549990e-03, -4.4488683e-03, 1.2665773e-03, -1.4139886e-03,
1.4697905e-03, -4.8091747e-03, 3.0768490e-03, 6.0989073e-04],
Finding Similar words like word “freedom”
So we can see here that we are getting semantic and relationship information here with word2vec and we are able to find the related words.In this example its “freedom”
Conclusion: Word2vec is very useful in finding semantic and relationship information between words and heavily used in AI & NLP application like Survey responses, comment analysis, recommendation engines, and more.
Please write your queries & comments and share your feedback.
Hope you like my article.Please hit Clap 👏(50 times) to motivate me to write further.
Want to connect :
If you like my posts here on Medium and would wish for me to continue doing this work, consider supporting me on patreon