News Article Clustering Using Unsupervised Learning

Austin L.E. Krause
5 min readAug 2, 2019

--

LinkedIn GitHub

Last week I made a post about an extractive text summarization tool I built with Python using NLTK and cosine similarity scores. That article can be found HERE. This week’s blog will focus on another piece of that project where I use unsupervised learning algorithms to cluster news articles followed by supervised learning algorithms to classify recent articles.

The Data

The data I am using is roughly 97k news articles that come from the years 2013–2017 and range from roughly 2k-15k characters in length. I store the dataset in a pandas dataframe for analysis, preview shown below.

The Algorithm

The algorithm I have chosen for this task is the K-Means Clustering algorithm which groups data points into a set amount of clusters based on the point’s location relative to the ‘centroids’. The reason this algorithm is a good choice is because it works well with unlabeled data (like text) however, it can be pretty computationally expensive. A brief summary of K-Means Clustering can be seen here.

A simple preview of K-Means on a 2-D graph.

Text Preprocessing

In order to prepare the news articles for K-Means, there are a number of preprocessing steps that need to be taken to clean the data.

The steps I have taken are:

  1. Eliminate non english articles
  2. Tokenize, lemmatize, and stem words within each article
  3. Concatenate words of each record into a string and convert the entire pandas series into a list of strings
  4. Perform Count Vectorization and remove stop words
  5. Transform list of strings with TF-IDF transformation

To speed up training stages, I cut each article down to it’s first 100 words to perform clustering. This comes from my personal assumption that the article’s main idea will most likely come up early on in the text. After fitting a number of K-Means models, I settled on using the model with 12 clusters, however, this was chosen somewhat arbitrarily and still need to optimize this. The size of each individual cluster is shown below.

Looking At The Clusters

Now let’s look at the top words in each cluster:

  • Cluster 0: state, new, year, president, people, nation, one, unit, country, govern
  • Cluster 1: Trump, president, Donald, would, white, house, campaign, American, Washington, administration, nation
  • Cluster 2: Trump, republican, party, Donald, democrat, presidential, senate, candidate, GOP, voter, nominee
  • Cluster 3: one, year, new, first, world, game, live, week, people, make, get, say, work, show
  • Cluster 4: Clinton, Hillary, Trump, democrat, campaign, Sanders, presidential, election, emails, Bernie, support
  • Cluster 5: Trump, Russia, investigation, intelligence, comey, election, director, Putin, Flynn
  • Cluster 6: school, student, university, education, year, teacher, class, week, graduate
  • Cluster 7: court, supreme, justice, judge, rule, federal, senate, law, appeal, Obama, legal
  • Cluster 8: republican, care, health, house, bill, Trump, act, senate, Obamacare, president, insurance, reform
  • Cluster 9: please, story, great, need, write, continue, step, block, display, extend, part, idea
  • Cluster 10: company, year, percent, U.S., market, billion, bank, price, rate, stock, investor, share, report, oil
  • Cluster 11: police, state, attack, kill, North Korea, Islam, Syria, president, military, force

Analyzing The Clusters

Looking through the article clusters, a number of things stand out:

  • Clusters 0, 1 and 2 seems to be mainly political and it looks like clusters 1 and 2 mainly lean towards articles regarding the republication election campaign. Cluster 0 seems to be quite broad politically.
  • Cluster 3 looks extremely broad as well, and it is also the largest cluster BY FAR. This could be due to the fact that there are a large amount of articles in the dataset that have a wide range of topics. After testing my classification model, it looks like most sports articles will end up being classified as cluster 3.
  • Cluster 4 is quite strong and it is mainly based on articles about the democratic party and Hillary Clinton
  • Cluster 5 is specifically related to articles written about Russian meddling in the 2016 election
  • Cluster 6 shows a strong relation to articles written about education
  • Cluster 7 is highly related to the federal court system
  • Cluster 8 looks to be primarily about political issues such as health care, tax reform etc
  • Cluster 9 is another cluster that has a wide range of topics that don’t seem to generalize to a small amount of ideas
  • Cluster 10 is clearly made up of articles regarding financial markets
  • Cluster 11 to me is the most impressive, this cluster seems to be built around police, military and foreign conflicts

Further Clustering

After noticing the size and proportion of cluster #3, I had the inclination to re-cluster the rows within it. As it turns out, Each cluster created from cluster #3 was solely devoted around Donald Trump and the election. Because of this, I decided not to include my re-clustering into my model, instead I left cluster #3 as is. As it turns out, the top words for cluster #3 were misleading, although I reached an answer after digging deeper. This cluster still remains broad but it seems that it is centered politically as well.

Classifying New Articles

Finally, I used the created clusters to add a new column to the dataframe which showed each news article’s assigned cluster. Now I am able to use supervised learning algorithms to classify new articles. I trained a number of different algorithms for this task and the results can be seen below.

The overall top performer was XGBoost which classified articles into their correct cluster with a test accuracy of 75.6%.

Going forward I will look to optimize the number of clusters as well as try to bring in more non-political articles to try to adjust for the heavy class imbalance.

Have any advice to take this project further? I’d love to hear from you! Feel free to connect with me on LinkedIn and check out the source code on my GitHub.

Cheers.

--

--

Austin L.E. Krause

Data Scientist at Trusted Media Brands | Weekend Web Hacker