Bengali Text Visualization Using Word2Vec

Published in

Analytics Vidhya

4 min readSep 6, 2019

Text Visualization is an important part of text analysis and text mining.
Some technique is followed by visualization of the text. One of them is the word of show technique. Word show using the vector method is a very common term. Its main task is to keep the words around and putting their vector values together. This will helps to solve the text related problems later.

I have been working on Bangla text for some days. Making a show at a Bengali text in a plot was a little bit challenge. Because it is not like other languages. Bengali language characters do not support directly in a plot. So need to a specific Bengali font for showing each character. Remove punctuation from Bengali text is another challenge because without using the Unicode maximum number of punctuation can not remove from a Bengali text.

For this, I have explained below how to remove punctuation from Bengali text and how to do a show Bengali text on a plot using Word2vec.

Required Library

We use libraries to reduce the complexity of coding. So using the library helps us to do things very easily and quickly. Here we have used popular libraries. All of these libraries have been widely used in other languages before.

gensim: Open-source library for natural language processing. It is developed by python and cython. Text analysis, Word2Vec, Doc2Vec is the main uses of the gensim library.
sklearn: Dynamic and effective Machine Learning Library. Data preprocessing, model selection, classification, Regression, clustering are the main uses of sklearn.
matplotlib: Used for 2D and 3D plotting. Numerical extension of NumPy array. Histogram, Scatter plot, Bar chart are an example of matplotlib.

Step1:

After declaring gensim library, we import word2vec. Word2vecis a two-layer neural network it is used for word embedding. Return the vector value of its words using a large text document as input. Which are multidimensional. Then import PCA (Principal Component Analysis) from sklearn because it converts the values to a correlated to uncorrelated set. Now import pyplot from matplotlib which will help to display output in the graph. Since Bengali text does not support fonts, the font manager has to be imported from matplotlib.

import gensim
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
from matplotlib import pyplot
import matplotlib.font_manager as fm

Step2:

Take long text from the user and put it in a variable (t1).

t1=input("Enter Text:")

Step3:

The text needs to be preprocessed before word2vec. Here at the preprocessing stage, we have just removed the regular expressions. whitespace removal, bangla_fullstop removal, and punctuation. This requires importing a regular expression first. Symbols of Bengali language are not directly supported here. So they have to remove from the text using Unicode. Here used regular expression and sub() function is used to replace substrings.

import re
whitespace = re.compile(u"[\s\u0020\u00a0\u1680\u180e\u202f\u205f\u3000\u2000-\u200a]+", re.UNICODE)
bangla_fullstop = u"\u0964"
punctSeq   = u"['\"“”‘’]+|[.?!,…]+|[:;]+"
punc = u"[(),$%^&*+={}\[\]:\"|\'\~`<>/,¦!?½£¶¼©⅐⅑⅒⅓⅔⅕⅖⅗⅘⅙⅚⅛⅜⅝⅞⅟↉¤¿º;-]+"
t1= whitespace.sub(" ",t1).strip()
t1 = re.sub(punctSeq, " ", t1)
t1 = re.sub(bangla_fullstop, " ",t1)
t1 = re.sub(punc, " ", t1)print(t1)

Step4:

After the text has been processed, put it in a list and made the text a split separator with a split() function.

doc=[t1.split()]
print(doc)

Step5:

The input text needs to be trained before Word2vec. Here we are using the gensim model. For the model train, we have placed the text in the list in the parameter calculation, the length of the total text, and the iteration number of the model.

model = Word2Vec(doc, min_count=1)model.train(doc, total_examples=len(doc), epochs=model.iter)

Step6:

The Vocabulary List of the model is placed in a variable (x) in the form. The PCA() function has been called for Principal Component Analysis. Where the component size is also 2. The model was then fitted and transformed to represent the vocabulary of the model.

x= model[model.wv.vocab]
pca = PCA(n_components=2)
result = pca.fit_transform(x)

Step7:

To present the vector files to the plot, the scatter plots were shown at this stage. The vector files of the model have the vocabulary that is taken to separate the words into the word variables. Different Bengali fonts have to be used to support Bangla text through Font Properties. For this, we have used Bengali font kalpurush.ttf. Using a for loop convert word tuples list then annotate the Bengali word in x-axis and y-axis. Finally, show() function is used to show the plot.

pyplot.scatter(result[:, 0], result[:, 1])
words = list(model.wv.vocab)
prop = fm.FontProperties(fname='kalpurush.ttf')
for i, word in enumerate(words):
	pyplot.annotate(word, xy=(result[i, 0], result[i, 1]),fontproperties=prop)
pyplot.show()

Github: https://github.com/AbuKaisar24/Bengali-Text-Visualization-Using-Word2Vec