Authors: Angel Montesdeoca and Ricardo Balduino
The World Economic Forum predicts that there will be 463 exabytes of data created each day. To put this in perspective, that’s about 90 times more data than the words ever spoken by humans! With all this data it can be daunting, if not impossible to try to make sense of it all.
In 2019, IBM decided to meet this challenge by releasing Watson Discovery’s Content Mining — the world’s most powerful AI-powered text mining tool that allows business and technical users to discover hidden insights by analyzing anomalies, trends, and relationships in their documents.
Rather than tell you about all the amazing features and capabilities of Watson Discovery’s Content Mining we wanted to show you how you could start finding insights within just a few clicks.
Watson Discovery works with your own business data, but we will demonstrate its capabilities in this blog by using a publicly available dataset provided by the Consumer Financial Protection Bureau (CFPB) . The CFPB is a U.S. government agency that collects complaints from consumers against banks, lenders, and other financial companies. The CFPB dataset we’ll be using contains nearly 370,000 complaints from 2016 to 2019.
How to Apply Content Mining
Let’s walk through some sample scenarios in Content Mining, that can help us find new insights, as we drill down into the complaints dataset.
Scenario 1: Identifying top complaints across U.S. companies
Using the “Pairs view” in Content Mining, we can easily identify the top issues filed against companies in the U.S. With the pairs view, you can compare any two facets and see the relevancy and count of each facet. In the screen below, we selected two facets from the CFPB data set: Company and Issue. The tool shows us that the top 3 types of complaints were submitted against the three major credit bureaus in the U.S. (Equifax, Experian, and TransUnion) and relate to incorrect information on the credit report.
Scenario 2: Identifying connections between State complaints
Using the “Connections view” in Content Mining, we can see the relevant relationships between different facets, in a graphical format, in order to find correlations that would otherwise be difficult to find. In the screen below, we used the Connections view in Content Mining, which allows us to pick a number of facets for visualization. In this example, we selected State and Sub-issue as facets. The graphic shows us the relevant relationships between them, from which we can easily select the entities of interest, for example, the California State and associated sub-issues (in this case, totaling over 95K entries from the dataset). We can click Analyze More in order to add that selection to the query, minimizing the effort and improving consistency of the query, as compared to writing the query manually (Content Mining also allows users to enter the query manually on the top bar if desired).
Scenario 3: Identifying the complaints and companies with the highest negative sentiment
Using the “Sentiment view” in Content Mining, we can analyze the sentiment across any of the available facets. In this scenario, we kept the previous scenario selection of the 4 sub-issues types reported by consumers in California, as can be seen in the query bar on the top of the screen. Now it would be interesting to know the general sentiment of consumers across all issues. We can see below the distribution of positive (green), ambivalent (gray), neutral (white), and negative (red) sentiments for each issue on the top-left widget. By picking the first issue of the list (Incorrect information on credit report), we can analyze the specific text that denotes the sentiments. For example, by choosing negative phrases in the bottom left widget, we see that people are complaining about being victims of identity theft and the companies refusing to remove the fraudulently opened accounts.
Scenario 4: Identifying trends and spikes in complaints
Using the “Trends” and “Relevancy” views, we can understand our data from multiple angles. We can breakdown the structured portion of the data, to understand not only the high-level trends of complaints across months or years but also which companies are receiving the highest number of complaints, in addition to understanding why people are complaining (by identifying relevant phrases in the complaints).
In the screen below, we see the Trends view, looking at the trends of Sub-Products on a monthly basis. We can see the Credit Reporting sub-product expanded, and a spike in submissions towards the latter part of 2017. We selected that particular month and added it to the query.
By switching back to the Relevancy view and looking at Companies (in the next screen), we can see that Equifax has the majority of the complaints received on that particular month, more than all the other companies combined for the same time period (you may have figured out why that is, but let’s keep drilling down).
Now, we can use phrases identified by Content Mining to find the reason for those complaints. For example, by selecting the Noun Sequences available, we can see phrases like “security breach” and “data breach”. Those were indeed related to the data breach incident that Equifax suffered in 2017, exposing millions of customers information, causing among other things, an avalanche of complaints submitted to CFPB, visualized as a spike in the Trends view above.
In the field of data and analytics, the adage “a picture is worth a thousand words” could not be truer. Visualizations are a powerful way to find, understand and share insights. We showed just a few ways you can use Watson Discovery’s Content Mining to slice your data from different angles, answer different questions, and find new insights quickly.
Ricardo Balduino is a Data Scientist with IBM Data Science Elite Team. He holds a Master of Science in Software Engineering degree from the San Jose State University, California. Angel Montesdeoca is an IBM Watson Product Manager and works on the Content Mining component of Watson Discovery.
Want to know more?
Here are some other resources dedicated to the Watson Discovery Content Mining capability.