Clean up your topic models with Watson NLP

Alexander Lang
3 min readSep 6, 2022

--

scikit-learn + Watson = Club Topicana!

I like the topic modelling capabilities of scikit-learn, especially Non-Negative Matrix Factorization, aka NMF. Many data scientists use LDA, but I found NMF to produce more clearly distinguishable topics when the underlying documents share some key terms (for example, social media posts about the same brand or product).

But be it LDA or NMF, the quality of topic models improves by reducing the number of words the algorithm has to consider. Stopword filtering is a start, but with Watson NLP, you can do even better!

Consider the excellent topic modelling example from scikit-learn. Using the Kulbach-Leibler divergence on a newsgroup dataset, NMF already produces some fairly good topics — but some topics contain terms like don or ve,or a lot of numbers, which makes the topics harder to understand.

Topic Modeling based on scikit-learn’s newsgroups dataset

These topics are the result of running the following parts of the topic modelling example:

Now, let’s “clean up” up this model. Fortunately, the Syntax block of Watson NLP integrates easily into scikit-learn. It allows me to

  • Limit my analysis to words that have a particular part-of-speech, such as nouns or adjectives. For my topic models, I often use only nouns or proper nouns.
  • Use the lemma of a word, instead of its surface form that appears in the text. This way, both singular (“mouse”) and plural forms (“mice”) of a word get mapped to the same form (“mouse”). Unlike word stems that stemming algorithms produce, lemmas are actual words that users of the topic model immediately understand.

To use Watson NLP in scikit-learn, I first create the custom tokenization function watson_tokenizer in my Notebook:

This function receives the text to tokenize. It then runs Watson NLP’s syntax model for English to obtain the individual tokens, the lemma of each token and its part-of-speech. Then, it collects the tokens with the part-of-speech tags Noun or Proper Noun. If the token has a lemma, it returns the lemma — otherwise, it returns the surface form. Running it yields

Analyzing fairly intelligent rodents

To integrate watson_tokenizer into scikit-learn, I specify it as tokenizer when creating the scikit-learn Vectorizer that is used to pre-process the text for topic modeling.

Then, I fit the NMF model based on the new IF-IDF values:

The change is minimal — but the result is arguably better:

Topic Modelling of newsgroups, using noun lemmas from Watson NLP

To repeat: both results were achieved by running the scikit-learn topic modelling example within IBM Watson Studio, on the same document sample (2000 documents), with the same number of features (1000). All I changed was using a custom tokenizer instead of the scikit-learn stopword list.

This approach not only works for NMF topic modelling: I can use Watson NLP’s syntax block as a custom tokenizer in every scikit-learn Vectorizer that accepts a tokenizer attribute. This way, I can also improve LDA models, document clustering or text classification with the higher-quality text features from Watson NLP. However, for text classification, I recommend using Watson NLP all the way.-)

So, to give your topic models a nicer tan, go to IBM Cloud Pak for Data as a Service, launch a notebook with the NLP environment — and Welcome to Club T(r)opicana!

--

--

Alexander Lang

Architect in the IBM Watson Studio Team. Experience in Data Science, NLP and Social Media Analytics