What are they talking about? Topic Identification with Python

James Malcolm
Sep 4, 2018 · 5 min read

This article explores the process of using clustering techniques in Python to identify topics within a corpus of text, such as emails or news articles.

We start by performing basic text cleaning, preparing text data for the model, and finally we build a model using sklearn.

For this example we’re going to be using sklearn’s 20 newsgroups dataset. As these methods are unsupervised techniques, that is, they don’t rely on labeled data — these techniques can be extended to topic identification of other text data such as emails, tweets, books, etc.

We can get our training and test data as below:

from sklearn.datasets import fetch_20newsgroups
import pandas as pd
categories = [
'talk.politics.mideast',
'rec.motorcycles']
data_train = fetch_20newsgroups(subset='train', categories=categories)data_test= fetch_20newsgroups(subset='test', categories=categories)

Data Cleaning

When printing the data, you’ll notice that the output is messy with formatting issues, filler words, that we want to remove. When cleaning any text data I try keep these principles in mind:

  • We want to removing formatting remarks (such as: \n, \r, etc)
  • We want to remove stopwords (such as: a, the, to, etc)
  • We don’t want to remove value

Bearing this in mind, we’re going to use NLTK to remove stopwords and some straightforward regular expressions to remove formatting.

import nltkdf = pd.DataFrame({'col':train}) # makes a dataframe with our training datadf['col'] = df['col'].str.lower().str.split() # This splits words by space# Remove stopwords from text using NLTK.stop = stopwords.words('english')
df['col'] = df['col'].apply(lambda x: [item for item in x if item not in stop])
df['col'] = df['col'].str.replace(r'\W', ' ', case = False)
df['col'] = df['col'].str.replace(r'[.,?<>-]', '')

Finally. The Fun Part…

Now that all that data gathering and cleaning is out of the way we can finally begin to test our model.

Regardless of what algorithm we choose, we can’t feed the model our cleaned text and expect an output. Instead we need a way to represent the text in numerical format.

The easiest method to do this would be to take the word count of each document, and represent it in a vector such as below. We can do this easily in sklearn by using CountVectorizer.

There are other methods, such as the TF-IDF, word2vec, etc. For the purposes of this article we won’t explore these options, however TF-IDF is used in the full code attached.

vectorizer = CountVectorizer() # initialise vector
X = vectorizer.fit_transform(df['col'])

The output of X is a vector, which you can mentally visualise as a giant spreadsheet, where the rows are the observations (each article) and each column represents a word.

The values within this vector represent the word frequency.

The model this time, I promise…

Choosing algorithms largely depends on your goal, and the type of data you have.

In this case, our data is unlabelled as we have a bunch of news articles and we want to identify topics and label the articles with the newly found topics.

Taking a look at sklearn’s wonderful diagram on choosing the correct algorithm, we can see than we’re in the Clustering bubble and for this example we’ll be using the classic K-Means.

What algorithm do I need? Sklearn’s wonderful diagram

K-Means Clustering works by assigning data points to a centroid based on feature similarity. A great GIF demonstrating K-Means can be found here. That being said, K-Means Clustering works as so:

  • K Means estimates an initial position as to where the centroids should be (randomly guesses, or can be specified)
  • Each data point is assigned to its nearest centroid
  • The centroid location is updated by taking the mean of all data points assigned to each centroid
  • Repeats.

The result of this, is that you’re left with labelled data points based off their similarity to each other.

In our case, we’re trying to identify those emails that are similar to each other to infer the topic of that email. Let’s try it out with our example.

k_clusters = 2 # Number of Centroids we want.model = KMeans(n_clusters=k_clusters, max_iter=100, n_init=1)model.fit(X)

There we have it, our model has now been fitted to the data. You’ll notice, that in the code above we have to specify how many clusters we want. In this example, we know we imported two types of news articles: Middle East and Motorcycles, so it’s natural to guess that there’ll be 2 topics.

However, in most cases you won’t know how may topics there, so choosing the best fit relies you trying the algorithm over a number of clusters and comparing the results.

Now that we’ve built a model and grouped the data, we can use our model to label the training data used, and also perform prediction on test data or any new data going forward to categorise the topic.

Results!

Now, let’s take a look at our results with the following piece of code.

print("Top terms per cluster:")
order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()
for i in range(true_k):
print("Cluster %d:" % i),
for ind in order_centroids[i, :10]:
print(' %s' % terms[ind]),
print
print("\n")
print("Prediction")
Y = vectorizer.transform(["This Motorbike has the best chain"])# Enter phrase to predict
prediction = model.predict(Y)
print(prediction)
Y = vectorizer.transform(["Turkey is close to Israel"])
prediction = model.predict(Y)
print(prediction)
-- Output: --Top terms per cluster:
Cluster 0: bike it one get like ride would know go udont
Cluster 1: israel armenian isra arab jew peopl turkish would said kill
Prediction
[0]
[1]

There you have it! A relatively straightforward implementation of K-Means to extract topics. This article will hopefully act as a good starting point for your own analysis, and for more in-depth working on tuning parameters.

As I mentioned earlier, take a look the Github code for the full code and to try it out yourself using Python.

Keep reading for further updates, where we’ll look at statistical topic modelling right through to using Deep Learning to label text data.

Feel free to leave any comments or questions below.

Data Driven Investor

from confusion to clarity, not insanity

James Malcolm

Written by

Kiwi who looks at life through the lens of data and random thoughts.

Data Driven Investor

from confusion to clarity, not insanity

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade