Text classification using k-means

4 min readSep 16, 2018

Clustering is an unsupervised learning technique which means that it has no labeled data that tags the observations with prior identifiers. Thus the absence of the y-label(category information), makes it an unsupervised learning technique. One of the most popular an simple clustering algorithms is k-means as its still used even after it was first published in 1955.

“eyeglasses on black iMac” by Annie Spratt on Unsplash

The k-means algorithm majorly involves forming k-seeds first then grouping the observations into k clusters based on calculation of the most optimum number. This can be done using the elbow point algorithm. In this post i will demonstrate on how to use k-means algorithm to cluster headlines into different categories using python.

The data set used is obtained from kaggle data sets and the link is here:

A Million News Headlines

News headlines published over a period of 15 years.

www.kaggle.com

THE BASIC STEPS ARE:

First of course we need to download the data set from the link above and import it in the notebook of your choice.

Secondly follows the most important stage of the whole classification which is data cleansing. In this case it just involves trying to put all the headlines to be in one format.(in this case the headlines were all in lower case and there were no punctuation marks)
Removing all the insignificant words in the observations. This are referred to as the stop words. Examples include that, these, in e.t.c. In this case there is a library provided by nltk (natural language tool kit).

Stemming is also good practice as it involves reducing a word to its root form. For example the word detained can be reduced to detain or some even simpler form.

Stemming also helps to reduce sparsity degree of the data. The data can be termed as sparse due to the large number of infrequent words present.

Above it involves a code which specifies the range which is 0 to 1103665 records. I then initialized the variable stemmed_array to be equal to a column ‘headline_text’ which is to be clustered. The split function ensures that the sentence is broken down into a list of words. The words are then stemmed by calling the object ‘ps’ from the function PorterStemmer and the word stemmed must not be in the stop word list. Then the list of words is then joined back together using the join() function. The sentence is now appended to the blank data set list that was initialized before the stemming process.

The next step is to create a vector of all the words. This is now tokenization which is the process of taking each of the words in the observations and making a column for each.

The vectorizer can either be a count vectorizer or a TFIDF vectorizer. The count vectorizer just counts the instance/frequency of a word in an entire observation while a TFIDF vectorizer counts the ‘fraction’ of the time it occurs in an entire observation as its value increases proportionally to count.

Finally we can have the k-means algorithm applied to the vectorized data set. This is simply done as shown below:

The main part that could seem difficult to understand is the last cell in the above screenshot. The k-means algorithm involves adjustment of the centroids in this case using the euclidean distance as the metric for it. This is important after every iteration in the calculation of a new centroid.

One can predict their own sentence as shown below to identify the cluster which is the most applicable to the test data(the one you will fill in)

There you have your text classification model done :) ….

Text classification using k-means

A Million News Headlines

News headlines published over a period of 15 years.

Written by dennis ndungu