Text Mining Example on Consumer Financial Service Complaints

Lujing Chen
Bite-sized Machine Learning
5 min readDec 4, 2018

This blog will give a quick demonstration of a text mining example on the Consumer Financial complaints data published on the Consumer Financial Protection Bureau (CFPB) website.

This blog utilizes the text attribute — Consumer complaint narrative to answer one simple question: what are the top 10 key complaint words that the top 3 credit bureau agencies (Equifax, Experian, and TransUnion) received?

By extracting those keywords, it might help financial agencies like the credit bureaus, in this case, especially the compliance department to better target any potential risk or issue and eventually control the risk.

The blog can be break into three parts:

  1. count the word and get the words’ frequency
  2. calculate the ratio (between word frequency of individual agency and word frequency of the entire complaint lists) and get the words with the highest ratio value
  3. improve the ratio by excluding common words related to the company’ name

There are 1,171,183 complaints(records) and 18 features (variables), but this blog, we will be only using the text attribute — Consumer complaint narrative. After excluding the row with missing consumer complaint narrative, we are left 345,158 complaints(records).

Before we dive in our analysis, let’s first check one complaint and get a sense

Okay, looks reasonable. Let’s get started!

Step 1: count the word frequency

As the side by side comparison shown above for the top 10 common words used in different companies and entire complaint lists, common words like “the” is ranked the top among Equifax, Experian, TransUnion, and the Entire complaints lists.

Instead of finding the common words in the Equifax or Experian or TransUnion complaints, what we really want is those words that are shown far more often in one company’s complaint lists rather than the total complaint list. In other word, what complaint key word is concentrated uniquely for this company.

Step 2: calculate the frequent ratio

To accomplish this, we’ ll need to calculate the word usage ratio between individual company and the entire list. Use “the” as an example,

Note: the “+ 1 “ here is added in case the TOTAL_counts for some words is zero.

Dividing the company specific count on a word by the total count of the same word, we can let the company unique complaint key word stands out, and suppress the importance of common words like “the”.

As the side by side comparison shown above this time, we started to see the differentiation. Unfortunately, biggest differentiation in the complaints is the company name, which is again not we really interested about, we want to find what’s the real financial service issue in those complaints!

But we already very close to the answer we are trying to answer, just need one more step — leave out those word related to company names.

To accomplish this, we’ll skip the counting when it is company related word.

Step 3: improve the ratio

One simply way is just say if any time we see ‘Equifax’, for example, we skip counting the word frequency, so that this word will be automatically having a zero frequency, and therefore not showing in our most common list.

However, as we can see in the result above, when customer wrote complaint, they misspelled a lot. So by excluding just ‘Equifax’, is not gonna get us what we want.

Two ways of dealing with it:

  1. manually summarize the misspelled pattern
  2. using the library FuzzyWuzzy to implement a fuzzy matching

An example below is demonstrating how the fuzzywuzzy works. basically, it calculate a distance (called Levenshtein distance) to measure the difference between two sequences, in our case, two words. the higher the score, the more closer there two sequence are.

[(‘equifax’, 88)] stands for [(‘match’,score1)]

Back to our example, we for sure want to exclude those with high fuzzy score, which are those misspelled.

The pros and cons for the two approach I mentioned above:

  • the manually summarizing is fast, but can’t exclude the fuzzy words
  • the fuzzy matching is accurately excluding those fuzzy words, but the code can take very long time to run, since it has to compute the fuzzy score for every word in the complaint list.

Therefore, A mixed approach that combined these two options can be more practical and leads to faster and more accurate result.The mixed approach is, if a word contain certain string like “eq” in the Equifax complaints, those words are far more likely is to be a misspelled ‘Equifax ‘.

Therefore:

  • if the condition like containing “eq” is met, fuzzy matching function will be called.
  • if condition is not matched, simply count the word without any other processing.
  • when the fuzzy matching function is called, and the fuzzy_score is greater than 85, skip the counting, since it is probably a misspelled ‘Equifax’, which we don’t care that much

For the implementation, everything else keep the same as shown previously, and only revise this count_word function, since we need to update our new logic.

Let’s check our new results after applying this logic.

Those misspelled decreased a lot, although not all!

Based on this key word list, we can summarize:

  • Equifax’s complaint are mostly concentrated in TrustedID, Intruders, segmentation, re alleges, cyber attack and 2013correct.
  • Experian’s complaints are mostly concentrated in Geographical, Credit Works, free credit report, Inquiry, Delinquency.
  • Trans Union’s complaints are mostly concentrated in 3rd party info, Libellant, LLCConsumer, Inquiry, Financing.

That’s it. 😃

We answered our question — what are the top 10 key complaint words that the top 3 credit bureau agencies (Equifax, Experian, and TransUnion) received?

Hopefully, this blog demonstrate the power of the text mining, even it’s a simple one, in helping compliance department in the financial industry to gain insights on what are the potential risk based on the customer text data.

Further Links:

  • Check out the link for the dataset if you want to download it.
  • Check my github to see the complete code

--

--