Topic modeling of NASA’s documents using Latent Dirichlet Allocation (LDA)

Sanjoy Basu
5 min readJul 16, 2018

--

Introduction:

Topic modeling is suite of algorithms that aim to discover thematic information on large archives of documents. Topic modeling algorithms are statistical methods that analyze the words of the original texts to discover the themes.
Latent Dirichlet Allocation (LDA) is one such topic modeling algorithm developed by Dr David M Blei (Columbia University), Andrew Ng (Stanford University) and Michael Jordan (UC Berkeley). LDA is unsupervised machine learning in the field of Natural Language Processing.
In this article I will try to develop the intuition of LDA without getting into the mathematical details. We will then use python to do topic modeling of publicly available documents from NASA website.

Intuition :

The intuition behind LDA is documents exhibit multiple topics. Dr Blei and his team used the article “Seeking Life’s Bare (Genetic) Necessities,” for developing LDA. They color coded words appeared in the documents.

Group of words in each color code represents a topic. Topic can be formally defines as distribution over a vocabulary. For example in the above document distribution of words such as computer, prediction, computational marked in blue is about topic data analysis.Similarly genes, genome, sequencing marked yellow form the distribution of topics genetics and so on. So this article is blend of multiple topics. If Dr Blei and the team took the time to highlight every word (excluding stop words) in the article, we would see that this article blends genetics, data analysis, and evolutionary biology in different proportions. LDA takes this and cast it into probabilistic model of texts.

LDA assumes that these topics are specified before any data has been generated . LDA assumes each document arises as follows:

  1. Choose a distribution over topics. Bar chart on the right
  2. Choose a topic from the distribution (represented by yellow, blue, pink buttons).
  3. Choose a word from the corresponding topics distribution over the vocabulary. For example yellow distribution will pick words related to topics genetics. Repeat this for every word in the document.
  4. Turning to new document and choose new distribution of topics may be neuroscience and data analysis and then choose its words as the previous one (step 3).
    Population of topic remain same from document to documents idea behind LDA is how much each document exhibits those topic changes. So document can be viewed as mixture model of topics. Machine Learning algorithm solves the problem that we don’t see the structures we assume the structure is already there .The algorithm reveals the structure by inferring all the values of the hidden (latent) variables such as topic proportion of each documents and distribution of vocabulary in topics .

Topic Modeling using python:

We start by importing python libraries

import pandas as pd
import numpy as np
# NLTK
from nltk.tokenize import RegexpTokenizer
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import re
# Visualization
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import matplotlib
%matplotlib inline
import seaborn as sns
# OS
import os
# Genism
from gensim import corpora, models
import pyLDAvis.gensim

Like any data before we model we have to preprocess text data . Our preprocessing steps include tokenizing, eliminating stop words. Before we pull the data we will create a function for preprocessing so that the data will be processed as we ingest them.

def text_process(text):
'''
Takes in a string of text, then performs the following:
1. Tokenizes and removes punctuation
2. Removes stopwords
3. Stems
4. Returns a list of the cleaned text
'''
if pd.isnull(text):
return []
# tokenizing and removing punctuation
tokenizer = RegexpTokenizer(r'\w+')
text_processed=tokenizer.tokenize(text)

# removing any stopwords
text_processed = [word.lower() for word in text_processed if word.lower() not in stopwords.words('english')]


try:
text_processed.remove('b')
except:
pass
return text_processed

We will now ingest the data stored on local storage and then preprocess using above function.

docs_loc=os.listdir('./docs/')
texts=[]
docs_dir='./docs/'
for docs in docs_loc:
with open(docs_dir+docs) as doc:
content=doc.read()
texts.append(text_process(content))

Before we model let us get a glimpse of how our data looks like.

Now we will fit the data into our LDA model.

pyLDAvis.enable_notebook()
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]ldamodel = models.ldamodel.LdaModel(corpus, # pass in our corpus
id2word = dictionary, # matches each word to its "number" or "spot" in the dictionary
num_topics = 3, # number of topics T to find
passes = 5, # number of passes through corpus; similar to number of epochs
minimum_probability = 0.01) # only include topics above this probability threshold

Now it is time to visualize how LDA modeled the topics. As can be seen from above code we passed the parameter num_topics as 3 our model will give us 3 topics. We will go through visualization of each topics

We see words such as ozone, atmosphere, flight, emission tells us that our document contain topic on research related to ozone layer in earth’s atmosphere.

Next let us see what topic 2 has to tell us about our documents.

We see distribution of words such as flights, supersonic, commercial , sonic tells us the topics relates to aeronautics engineering. Finally we will visualize topic 3

Distribution of topics such as mars, mission and rover reveals the presence of topic related to Mars exploration.

Conclusion:

Topic modeling like LDA is unsupervised machine learning because it discovers hidden (latent) topics in corpus of documents. Topic modeling finds its application in publishing, content recommendation and any place that handles large number of documents such as legal, scientific research etc.

Reference

http://www.cs.columbia.edu/~blei/talks/Blei_ICML_2012.pdf

--

--