Unsupervised Topic Modeling Using Natural Language Processing (NLP)

Crystal Huang

Follow

Published in

Nerd For Tech

6 min readJun 1, 2021

--

For the 5th module’s project at Metis Data Science Bootcamp, we’re tasked to build unsupervised learning models that address a useful structure finding, topic modeling, and/or recommendation system in any domain of interest using data with primarily textual information.

For my project, I took the approach of Latent Dirichlet Allocation(LDA) for topic modeling on a large collection of textual data. And I was amazed at how powerful it can be.

Before jumping into the project, I’d like to share some helpful articles here instead of reinventing the wheel:

Natural Language Processing(NLP) —

Topic Modeling —

Disclaimer: I am new to machine learning and also to blogging. So, if there are any mistakes, please do let me know. All feedback is appreciated.

Backstory and Project Goal

Because of the rapid increase in scientific literature around COVID19, it is hard to keep up with the newest publications. Scientists and researchers are overwhelmed by the volume and struggle to find articles that are relevant to their work. In addition, the limitations of scientific conferences make it even harder to collaborate and stay up to date. However, it is crucial for scientists to be aware of ongoing research to find relevant publications.

So, the goal of this project is to build an unsupervised NLP model (Topic modeling and/or recommendation system) that helps researchers to navigate the current surge of papers about COVID-19, find what is relevant to their work, and uncover the hidden semantic relationships.

Data

CORD-19 https://www.semanticscholar.org/cord19/download

The data used in this project came from the COVID-19 Open Research Dataset, which is a collection of over 500,000 scholarly articles about the novel coronavirus for use by the global research community.

I took the subset of articles from January 2020 to May 2021, which is about 260,000 articles, and used the abstract of the articles as text in this project.

Tools and Approaches Used

Python (pandas, numpy)
langdetect
regex, string
spaCy, scispaCy (“en_core_sci_lg” model for biomedical, scientific, and clinical vocabulary)
NLTK
Scikitlearn
Gensim — LDA
WordCloud
pyLDAvis
Streamlit
Heroku

Methodology

Text Preprocessing

Filtered out non-English articles with langdetect
Used regex and string for simple cleaning
Used ScispaCy’s (a spaCy package) “en_core_sci_lg” model for biomedical, scientific, and clinical vocabulary to do parts of speech tagging
Used gensim for phrase detection
Removed stopwords with NLTK, used Lemmatization, kept noun, adjective, verb, adverb, and Tokenization

Data Transformation

Dictionary — with gensim.corpora
Corpus

Topic Modeling

I built my topic model using LDA from gensim since I want to discover latent relationships in the corpus.

I first tried 10 topics for my base model. But the topics are either too general or overlapping. While subjective inspection can be useful to evaluate a topic model, it was challenging and time-consuming for this large dataset. So I used coherence score to help find the optimal number of topics, which is 28 (coherence score: 0.523 vs baseline coherence score: 0.483).

Findings and Insights

Model Interpretation and Visualization

For interpretation, I used pyLDAvis to see the topic's similarity and relative frequency in the corpus.

A quick explanation of pyLDAvis — There are three important features of the pyLDAvis graph. First, each circle is a topic. The area of each circle is the topic prevalence. So The larger it is, the more articles are about that topic. Second, the distance between two circles is topic similarity. So The further they are, the more different they are. Lastly, on the right side, are the relevant terms listed which indicate the meaning of each topic. To learn more about the details of pyLDAvis, here’s a good resource.

At first glance of the pyLDAvis, there appears to be some overlapping of topics in my model. So I did some investigation on those overlapping groups. And I found that even though there are some overlapping, the topics are still distinguishable and meaningful. So overall, I’m satisfied with the final model performance.

Application Usage

With the LDA model, I assigned each article with a dominant topic and their relevance to the topic and grouped articles by topics for the recommendation system. So researchers can look up articles based on a topic that is related to their work.

I used Streamlit to develop the application and deployed the app on Heroku with a smaller dataset (due to the size limit on GitHub).

Here’s the app if you’d like to check it out!

Future Potential

Although the model is far from perfect, it showcases how topic modeling can be used to recommend articles in the same topic space.

I think definitely more work can be done on NLP preprocessing by tuning more hyperparameters. I would also like to see if how the research trends change over time.

It is always possible to extend the recommender — based on keywords, adding other filters like by author or time period, etc. Also, trying different models for the recommender, such as a higher-dimensional SVD-style model that may be able to make precise comparisons between different articles and match them.

References

Brainard, J. 2020. “Scientists are Drowning in COVID-19 Papers. Can New Tools Keep Them Afloat?” Science. https://www.sciencemag.org/news/2020/05/scientists-are-drowning-covid-19-papers-can-new-tools-keep-them-afloat#
Lee, J. 2020. “Benchmarking Language Detection for NLP.” Towards Data Science, Medium. https://towardsdatascience.com/benchmarking-language-detection-for-nlp-8250ea8b67c
Kapadia, S. 2019. “Evaluate Topic Models: Latent Dirichlet Allocation (LDA)” Toward Data Science, Medium. https://towardsdatascience.com/evaluate-topic-model-in-python-latent-dirichlet-allocation-lda-7d57484bb5d0
Tran, K. 2021. “pyLDAvis: Topic Modelling Exploration Tool That Every NLP Data Scientist Should Know.” Neptune Blog. https://neptune.ai/blog/pyldavis-topic-modelling-exploration-tool-that-every-nlp-data-scientist-should-know
Wang, L, et al. 2020. “CORD-19: The COVID-19 Open Research Dataset.” Association for Computational Linguistics. https://www.aclweb.org/anthology/2020.nlpcovid19-acl.1

Takeaways

Overall, I’m impressed by the power of NLP and unsupervised learning. Unsupervised learning is a very open-ended subject. It opens up a lot of possibilities, nevertheless, can be challenging to evaluate and interpret. Coherence score and pyLDAvis were helpful in this case. However, it’s always good to explore various ways to visualize and evaluate the model.