Hierarchical Topic Modeling Using Watson NLP

Published in

IBM Data Science in Practice

6 min readNov 8, 2022

What is Topic Modeling?
Topic modeling is an unsupervised machine learning algorithm that is used to convert unstructured content into a structured format in the form of a set of similar documents, detecting word and phrase patterns within them. It clusters word groups and similar expressions to best categorize a set of similar documents. Topic Modeling is also known as Topic Extraction, Topic Analysis, and Topic Detection.

Watson NLP Topic Modeling supports a hierarchical clustering algorithm, which allows the model to determine the ideal number of clusters without the need for human interaction. It provides better explanations of topics with keywords. In this blog, you will see how to analyze consumer financial data and build a topic modeling model using the Watson NLP library step by step.

1. Collecting the dataset

Let’s take an example of the consumer financial complaint database collected from the Consumer complaint database.
Once you have downloaded this dataset. You can upload it to the Watson Studio instance by going to the Assets tab. After that, you can drop the data files as shown below.

Once the dataset has been uploaded to the Watson Studio project, you can read the CSV file into Panda’s data frame from the notebook.

2. Data processing and Exploratory Data Analysis

To process and better understand this data, we categorize it by month & company wise.

2.1 Month & Year-Wise Data
To collect the data by using months, we converted the ‘date_received’ column into a time frame and added ‘Month’ and ‘Year’ to two new columns.

2.2 Company-Wise Data

To collect the data company wise extracted the “Top 20 Companies” based on the frequency.

2.3 Text pre-processing
Our first step is to pre-process the documents. In a way that cleans distracting signals and makes them easier to process and analyze. This is a standard step in many NLP pipelines. Here we are performing Stop-words filtering, removing some common patterns & Lemmatisation. We are using Watson NLP's pre-defined stop words list. You can remove & extend this stop words list. You can download this list by using the ‘downlaod_and_load’ method of the Watson NLP library. Using all these steps to clean the data and process it for extracting topics from documents.

3. Topic Modeling

Watson NLP Topic Modeling uses two types of modeling:
1. Summary Model
2. Hierarchical Clustering Model

A summary model consists of a mapping of words to their occurrences in all the documents. It provides a dictionary of parameters. By using these parameters, you can change the trained summary model.
Summary Model Train Params :

By using the above method, we can train documents in a summary model. In this method, train data requires in the form of Syntax data. This syntax data is getting by the Watson NLP Syntax Model. This Summary model output will pass as input into Hierarchical Clustering Modeling.

Hierarchical clustering model train parameters:

Train params and Hierarchical clustering model

This Topic model provides output in JSON format. Which can consist of the Topic Name, No of documents, percentage, most important keywords, phrases, sentences, etc. We converted this into Pandas data frame.

After analysis of this trained topic model, we can see these top 15 topics basis on the percentage which is talked about most by the consumer regarding the company.

Save trained model

You can save the trained model using the save() method from the Watson NLP library. OR using the project.save_data () function from Watson Studio, as shown below.

Load trained model

You can load the trained model using the load () method from the Watson NLP library. So we can further use this model to predict the topics & keywords from the text.

Test trained model

You can test the model by using the run() method from the Watson NLP library.

4. Analysis of month-wise companies topics

The topic modeling can also be leveraged to extract the most frequent topics by month. We have already created a month-wise data frame in the data processing step so we can use it now to gain insights on the most frequent topic each month and observe if there is a pattern.

The most frequent issue for PayPal in the month of March is the item received counterfeit and balance account refund personnel. This information can be used for either the subsequent month or for the next April (or any other month). they can directly resolve this issue rather than doing more investigation again and again.

Conclusion

We have seen how easily we can identify the topics from consumer financial datasets by using Watson NLP. To know more & learn about Topic Modeling using Watson NLP. You can download & see the complete code on GitHub.

This topic model can be used to understand the pain points and major areas of improvement day by a day/ weekly/monthly or yearly basis. Based on this information, companies can create self-service content or direct support to help customers.

You can start your AI journey by browsing & building AI models through a guided wizard here. The IBM Build Lab team is here to work with you on your AI journey. For more information, Embeddable AI Webpage.

You can also additionally browse the collection of Embeddable AI self-serve assets at Tech Zone and on GitHub.