News documents clustering using python (latent semantic analysis)

Published in

Kuzok

7 min readMar 24, 2019

This post was published in mc.ai

Result after clustering 10000 documents (each dot represents a document)

TLDR: News documents clustering using latent semantic analysis. Used LSA and K-means algorithms to cluster news documents and visualized the results using UMAP (Uniform Manifold Approximation and Projection).
Considering the frequency(tf-idf) of important words in the news documents, the news documents are clustered where the related documents are shown using the same color which can be seen in the screenshots in the end. The color is decided by using k-means(running k-means on data separately and giving integer values to each documents based on k-means similarity results) and the actual positioning of documents(each document is represented by a dot on the graph) is achieved by applying LSA, thus verifying the results obtained using k-means.

In this article, I will explain how to cluster and find similar news documents from a set of news articles using latent semantic analysis (LSA), and comparing the results obtained by LSA vs results obtained by k-means, and visualising the data using UMAP.

Latent Semantic Analysis is a technique for creating a vector representation of a document. Having a vector representation of a document gives you a way to compare documents for their similarity by calculating the distance between the vectors. This in turn means you can do handy things like classifying documents to determine which of a set of known topics they most likely belong to.

You can see the final code on github.

Data Reading

Firstly importing some necessary modules:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sys
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
# from bs4 import BeautifulSoup as Soup 
# Beautiful soup can be used to parse xml/html data, depends on the type of data
import json

Sample data available to me:

{"id": "b2e89334-33f9-11e1-825f-dabc29fd7071", "article_url": "https://www.washingtonpost ........ "source": "The Washington Post"}
{"id": "749ec5b2-32f5-11e1-825f-dabc29fd7071", "article_url": "https://www.washingtonpos  ........ "source": "The Washington Post"}
{"id": "69654742-33d7-11e1-825f-dabc29fd7071", "article_url": "https://www.washingtonpost ........ "source": "The Washington Post"}
{"id": "d5966ad2-33f9-11e1-825f-dabc29fd7071", "article_url

This data contains >50,000 python dicts. The following code is used for loading and storing the data in a list of strings:

file = filename.txt
content = []
with open(file) as f:
    content = f.readlines()
content = [json.loads(x.strip()) for x in content]
# print(content)
    
data = json.loads(json.dumps(content))

# preprocessing ////////////////////////////////content_list = []
for i in data:
    string_content = ""
    if "contents" in i:
       for all in i["contents"]:
          if "content" in all:
             string_content = string_content + str(all["content"])
       content_list.append(string_content)

content_list contains the complete data in a list of strings. So if there are 45000 articles, content_list has 45000 strings.

Data Preprocessing

Now we will use python pandas library to apply some preprocessing techniques. To start with, we will try to clean our text data as much as possible. The idea is to remove the punctuations, numbers, and special characters all in one step using the regex replace(“[^a-zA-Z#]”, ” “), which will replace everything, except alphabets with space. Then we will remove shorter words because they usually don’t contain useful information. Finally, we will make all the text lowercase to nullify case sensitivity.

news_df = pd.DataFrame({'document':content_list})
    
# removing everything except alphabets`
news_df['clean_doc'] = news_df['document'].str.replace("[^a-zA-Z#]", " ")# removing null fields
news_df = news_df[news_df['clean_doc'].notnull()]
# removing short words
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>3]))# make all text lowercase
news_df['clean_doc'] = news_df['clean_doc'].apply(lambda x: x.lower())

Now we will remove the stopwords from data. First, I load NLTK’s list of English stop words. Stop words are words like “a”, “the”, or “in” which don’t convey significant meaning.

stop_words = stopwords.words('english')
    stop_words.extend(['span','class','spacing','href','html','http','title', 'stats', 'washingtonpost'])
# data is from washingtonpost, and it occurs heavily in every article, thus added to stop words# tokenization
tokenized_doc = news_df['clean_doc'].apply(lambda x: x.split())# remove stop-words
tokenized_doc = tokenized_doc.apply(lambda x: [item for item in x if item not in stop_words])
# print(tokenized_doc)# de-tokenization
detokenized_doc = []
for i in range(len(tokenized_doc)):
    if i in tokenized_doc:
        t = ' '.join(tokenized_doc[i])
        detokenized_doc.append(t)# print(detokenized_doc)

Applying Tf-idf to create Document-Term Matrix

Now, we have our data ready. We will apply tf-idf vectoriser to create a document-term matrix. tf-idf short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection.

Tf-idf value is directly proportional to the number of appearances of a word in a document, and is offset by the number of documents containing that word, which compensates for the fact that some words appear more commonly than others in general.

We will use sklearn’s TfidfVectorizer to create a tf-idf matrix with 10,000 terms.

from sklearn.feature_extraction.text import TfidfVectorizer# tfidf vectorizer of scikit learn
    vectorizer = TfidfVectorizer(stop_words=stop_words,max_features=10000, max_df = 0.5, use_idf = True, ngram_range=(1,3))X = vectorizer.fit_transform(detokenized_doc)print(X.shape) # check shape of the document-term matrixterms = vectorizer.get_feature_names()

(X.shape gives (22618, 10000) number of documents = 22618, number of words=10000)

ngram_range: this just means I’ll look at unigrams, bigrams and trigrams. See n-grams.

This document-term matrix will be used in LSA, as well as for applying k-means for clustering the documents.

Clustering text documents using k-means

In this step we will cluster the text documents using k-means algorithm. K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.

This clustering is being used purely for plotting purposes here.

from sklearn.cluster import KMeansnum_clusters = 10km = KMeans(n_clusters=num_clusters)km.fit(X)clusters = km.labels_.tolist()

clusters array will be used for plotting. clusters is a list containing numbers 1 to 10, classifying each documents in 10 clusters. Read more about scikit k-means here. Based on values of clusters array, we will assign each document a unique color at the time of plotting. This will become more clear as we move forward.

Topic Modeling

The next step is to represent each and every term and document as a vector. We will use the document-term matrix and decompose it into multiple matrices. This is basically the LSA part.

We will use sklearn’s randomized_svd to perform the task of matrix decomposition. You need some knowledge of LSA and Singular Value Decomposition (SVD) to understand the below part. SVD is basically a factorisation of a matrix.

In the definition of SVD, an original matrix A is approximated as a product A ≈ UΣV* where U and V have orthonormal columns, and Σ is non-negative diagonal.

Read more about LSA here and scikit’s randomized_svd here.

k=number of topics/concepts, m=number of documents, n=number of words

# applying lsa //////////////////////////////
from sklearn.utils.extmath import randomized_svdU, Sigma, VT = randomized_svd(X, n_components=10, n_iter=100,
                              random_state=122)#printing the conceptsfor i, comp in enumerate(VT):
        terms_comp = zip(terms, comp)
        sorted_terms = sorted(terms_comp, key= lambda x:x[1], reverse=True)[:7]
        print("Concept "+str(i)+": ")
        for t in sorted_terms:
            print(t[0])
        print(" ")

Here, U, sigma and VT are 3 matrices which are obtained after decomposing the matrix X (document-term matrix obtained by tf-idf). VT is a term-concept matrix, U is document-concept matrix and Sigma is concept-concept matrix.

In above code, we take 10 concepts/topics ( n_components=10). Then we print those concepts. Example concepts are printed below:

Topics Visualization

To find out how distinct our topics are, we should visualize them. Of course, we cannot visualize more than 3 dimensions, but there are techniques like PCA and t-SNE which can help us visualize high dimensional data into lower dimensions. Here we will use a relatively new technique called UMAP (Uniform Manifold Approximation and Projection).

import umapX_topics=U*Sigma
embedding = umap.UMAP(n_neighbors=100, min_dist=0.5, random_state=12).fit_transform(X_topics)plt.figure(figsize=(7,5))
    plt.scatter(embedding[:, 0], embedding[:, 1], 
    c = clusters,
    s = 10, # size
    edgecolor='none'
    )
    plt.show()

Here, i have used c=clusters, which will be helpful in showing different colors in documents. Read the official docs of UMAP to learn more about all the parameters.

Here, I am showing the output of 2500 news articles:

Another output with 10,000 news articles:

Here, each dot represents a document and the colours represent the different clusters which were found out using k-means. Our LSA model seems to have done a good job as same colors(obtained by k-means) are clustered together(using LSA). Feel free to play around with the parameters of UMAP to see how the plot changes its shape.