Text clusterization using Python and Doc2vec

Alexey Ermolaev
3 min readAug 25, 2018

--

Let’s imagine you have a bunch of text documents from your users and you want to get some insights from it. For example, you can have millions of reviews about some goods if you’re a marketplace. One more possible case is that everyday users create text documents using your service and you want to classify these documents into some groups and then propose these predicted types to users. Sounds cool, isn’t it?

The problem is that you don’t know document types in advance: it can vary from 10 to thousands possible classes. And, of course, you don’t want to do it manually. Happily, we can use simple Python code for clustering these documents and then analyze predicted clusters.

What is clustering?

Clustering — unsupervised technique for grouping similar items into one group. As for the texts, we can create embedding of the whole text corpus and then compare vectors of each sentence or text (depending on which embedding you used) with cosine similarity.

Ok, but what is the text embedding? A word embedding is a learned representation for text where words that have the same meaning have a similar representation. It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

Let’s write some code

Firstly, let’s import all necessary libs

import pickle
import pandas as pd
import numpy
import re
import os
import numpy as np
import gensim
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from gensim.models import Doc2Vec

Then, let’s suppose we have a .csv file where we saved our text documents.

train= pd.read_csv(‘train.csv’)

Now we have train dataset which we can use for creating text embeddings. As well as, in our case one item is a text, we will use text-level embeddings — Doc2vec.

Firstly, let’s prepare our data. I assume that all text information is stored in text text column of our dataset. Doc2vec requires text to be prepared in a specific way, so let’s write a simple code for it.

LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train[‘text’].values: all_content_train.append(LabeledSentence1(em,[j]))
j+=1
print(“Number of texts processed: “, j)

Let’s then define and train our model (it will take some time according to your system capabilities)

d2v_model = Doc2Vec(all_content_train, size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)

Now we have trained embeddings and it’s time to cluster it.

kmeans_model = KMeans(n_clusters=4, init=’k-means++’, max_iter=100) 
X = kmeans_model.fit(d2v_model.docvecs.doctag_syn0)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=2).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure
label1 = [“#FFFF00”, “#008000”, “#0000FF”, “#800080”]
color = [label1[i] for i in labels]
plt.scatter(datapoint[:, 0], datapoint[:, 1], c=color)
centroids = kmeans_model.cluster_centers_
centroidpoint = pca.transform(centroids)
plt.scatter(centroidpoint[:, 0], centroidpoint[:, 1], marker=’^’, s=150, c=’#000000')
plt.show()

Here I chose 4 clusters to show and got plot like this

It’s easy to see that all our data can be divided in very straightforward clusters.

--

--