Unsupervised-Text-Clustering using Natural Language Processing(NLP)

ROHITH RAMESH
5 min readSep 30, 2019

--

What is Supervised Learning and Unsupervised Learning?

The type of Machine Learning models that learn by labelled example are called ‘Supervised Learning’.

· Classification (Target values are discrete classes)

· Regression (Target values are continuous values)

To find structure in unlabelled data is called ‘Unsupervised Learning’.

· Find groups of similar instances in the data (Clustering)

· Finding unusual patterns (Outlier detection)

Here in this article, we are going to look at Unsupervised Learning with respect to clustering.

So, what is Clustering exactly?

Grouping of similar data together is called as Clustering. And this is obtained by calculating the distance between the points.

There are two types of clustering that are predominantly used.

· K-means Clustering

· Hierarchical Clustering

Here, we will look at K-means Clustering.

What is K-means Clustering?

Grouping similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset. The K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster. The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.

Here, the initial value of K i.e. the total number of centroids would be defined by us.

Example with 3 centroids , K=3

Note : This project is based on Natural Language processing(NLP)

Now, let us quickly run through the steps of working with the text data.

Step 1: Import the data. (Incident data for server related issues)

Here, when we look at the data , the data consists of comments given by people for a particular incident tickets.we can see that each line has different text and the thing that we are concerned are the actual problem description on the server.

Here we would need the help of regular expression to get the exact content tat we need.

Raw Dataset

Step 2 : Regular Expression to extract the content from the raw dataset.

Kindly refer to the below link

https://github.com/rohithramesh1991/Content-Extraction

Function to extract the required content

Step 3: Text Preprocessing — Is nothing but cleaning the data.

  1. Removing punctuation.
  2. Remove Stopwords
  3. Remove Additional Spaces and Digits.
  4. And Lemmatize the Text.
  5. Returns cleaned list.

https://github.com/rohithramesh1991/Text-Preprocessing

Creating vocabulary after cleaning the text

Step 4: Vectorization is nothing but creating a vector of words called vocabulary.

TFIDF Vectorizer is used to create a vocabulary. TFIDF is a product of how frequent a word is in a document multiplied by how unique a word is w.r.t the entire corpus. ngram_range parameter : which will help to create one , two or more word vocabulary depending on the requirement.

TFIDF Vectorization

Step 5: Kmeans Clustering

K-Means with K=60

Step 6: To Find the Optimal Number of Clusters.

So, how do we decide on the total number of optimal clusters?

A good cluster is one with the distance between the points within the cluster must be less and the distance between the two centroids of two different clusters must be more.

And we evaluate for the optimal number of clusters using two predominant methods

·· Elbow method

· Average Silhouette method

· Gap Statistic

Here, we will look at the Elbow Method.

The “elbow” method to help data scientists select the optimal number of clusters by fitting the model with a range of values for K. If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point. The total WSS(within-cluster sum of square) measures the compactness of the clustering and we want it to be as small as possible. The Elbow method looks at the total WSS as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn’t improve much better the total WSS.

where Ck is the kth cluster and W(Ck) is the within-cluster variation. The total within-cluster sum of square (wss) measures the compactness of the clustering and we want it to be as small as possible.

How it calculates :

  • For each k, calculate the total within-cluster sum of square (wss).
  • Plot the curve of wss according to the number of clusters k.
  • The location of a bend (knee) in the plot is generally considered as an indicator of the appropriate number of clusters.
To Find Optimal Number of Clusters
Elbow Plot

Try with different range of k values to find the optimal number of cluster.

For the implementation codes, kindly refer “https://github.com/rohithramesh1991/Unsupervised-Text-Clustering

Thank you.

--

--

ROHITH RAMESH

Keep Developing Your Skills and Encourage Data-Driven Culture