Gensim LDA Topic Modeling for Article Discovery

Using Machine Learning to Create a Covid-19 Research Tool

Werbenschmidty

Published in

Analytics Vidhya

6 min readDec 27, 2020

Plate Notation for LDA Model (https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)

Project Overview:

The purpose of this project is to use LDA topic-modeling to find scientific journal articles related to a prompt. I coded this on a notebook provided by Kaggle, and used a data set of thousands of scientific journal articles also provided by kaggle. By the end of this tutorial you will be able to input a paragraph of information — and retrieve articles of a similar nature.

Imports:

These are all the imports that I used in the project. Most are self explanatory — others will be discussed as they come up.

Text Cleaning Method:

As people we automatically parse the words on the screen and don’t let different contexts affect our perceptions of them. Cleaning parses out unnecessary noise like puncuation, numbers and symbols so that only words are left.

Using the re python library we can use regular expressions to find unwanted patterns in strings and remove them. Check out https://regex101.com to test your own.

Creating the Pandas Data Frame:

The Pandas library provides an easy to use intuitive data structure. All we have to do is populate a dictionary object and call pd.DataFrame().

Kaggle stores its articles in JSON format in subdirectories, the easiest way to access these is to add the full filepath of each .json file to a list and repeatedly call json_load(filepath).

There is a lot going on in this piece of code, but it’s actually not too complicated. We are just retrieving the data from the json file and then cleaning, stemming and tokenizing it then adding the data to the dictionary.

Stemming parses the string and removes endings so that all that is left is the stem of the word. So the sentence “runners running fast run faster than runners running slow” becomes “run run fast than fast run run slow”. This helps because as people we understand the semantic connection between runners,running and run — but for ML algorithms we have to make that connection a little more obvious.

Tokenization makes every separate word its own token, this format makes it easy to count frequencies of words and also create dictionaries.

Finally we remove any tokens that are contained in STOPWORDS array. Stopwords are words so common that they provide almost no meaningful contribution to the model. Gensim provides a list of the most common ones that can be used by just importing it.

At this step all of our readable scientific articles has become not much more than a block of incohorent almost-words.

Training the LDA Model:

To train the LDA we need 4 things: The id2word Dictionary, The Corpus, The Number of Topics, and the Number of Passes.

Id2word Dictionary: This dictionary assigns an id to every distinct word in the collection of texts that is inputted.

Corpus: The corpus is the frequency of every word in the dictionary for every document that the model will be trained on.

Number of Topics: The number of topics is the number of probability clusters the model should segregate the words into. For this model we will go with a slightly higher number of topics with 150.

Number of Passes: The number of times the training model will run through the entire corpus updating probabilities.

Creating the Prompt Corpora:

Every prompt has a subdirectory full of relevant articles, so we can write a method to collect and combine their titles. Then we can concatenate that with the orginal prompt to create a final corpus that we can use as a comparison when looking for other related articles.

Then we have to clean, stem and tokenize the strings.

Computing the Topic Distributions:

First we create the helper method that returns the topic distribution for the given article. Then we call it on each article and save the result to a list that we can append to the data frame.

What exactly is a topic? In this specific case a topic is a probability distribution of words. Related words will have higher probabilities for similar topics.

These aren’t complete topic distributions because they don’t account for all 150 topics. So we repeat the process with the other helper method, create_full_vectors which creates the full probability vectors.

Retrieving Related Articles:

To compare articles we will use the jensen-shannon distance provided by the spicy library. The lower the distance between articles is the more related they are.

Validation:

Check out the results and see how related the top documents are compared to the prompt. For our model we only used a random 2000 articles, so we might miss out on some articles.

Next Steps:

This model can be altered to improve performance in a couple different ways.

Perform some EDA and some more specific stopwords to be filtered out. You can do this by plotting the histogram of how many times a word appears at least x times in a document. If a word appears at least x times in around 80% of documents it’s probably safe to remove.
Test out different number of topics to train on. We did 150 — but see how well 100 and 50 topics work out.
Train and test on more documents than 2000, this will give you a more accurate model, and also it will give you a larger bank of documents that could hold more relevant articles.
Try messing around with different forms input articles. Maybe short specific key-word loaded queries will work better than large clumps of random relevant information?
Bigrams and trigrams — in addition to creating a dictionary entry for every unique word, also create one for every unique pair of words, and every unique 3 word pairing. You can get more information out of every document — but beware the training times.