Use-case example: TF-IDF used for insurance feedback analysis

A look under the hood of „term frequency — inverse document frequency”, examples and pros & cons.

Vojtech Poriz
DataSentics
4 min readDec 8, 2020

--

Photo source

Consider the following scenarios: Online marketplace wants to optimize its search engine and extract important information from product descriptions. A football club wants to analyze posts on their blog. A food company wants to analyze Twitter posts about their brand new snack. An insurance company wants to know what are their clients complaining about in a survey.

How do you solve these problems using data science? What methods do you use?

Just one! You can get away with a method called TF-IDF. What is it? How do you compute it? What problems can it solve and where it falls short?

Continue reading! In this article, I will answer all these questions for you and guide you through the whole process of TF-IDF usage.

Let’s get practical

Imagine you are the insurance company described above. We sent out a survey to improve our customer experience and immediately got three responses:

‘The online system for reporting insurance claim doesn’t work on the phone.’

‘I was surprised by the speed of resolvement of my insurance claim.’

‘I paid this expensive insurance for 10 years, but you declined my claim? Unbelievable!’

Bag of words

Because machine learning models cannot work with text data directly, we need to convert these responses into some numerical representation. We will start by transforming the text into tokens — individual words.

[ ‘The’, ‘online’, ‘system’, ‘for’, ‘reporting’, ‘insurance’, …]

[‘I’, ‘was’, ‘surprised’, ‘by’, ‘the’, ‘speed’, …]

[‘I’, ‘paid’, ‘this’, ‘expensive’, …]

Then we create a corpus — set of all the words from the responses.

[‘but’, ‘by’, ‘claim’, ‘declined’, ‘doesn’t’, ‘expensive’, … , ’the’, ‘this’, ‘Unbelievable’, ‘was’, ‘work’, ‘years’, ‘you’, ‘10’]

For each response, we mark the number of occurrences of each word.

Term frequency table of our responses

We have just created a popular text representation — bag-of-words. What’s more, we have also implemented something called TF — term frequency. Each word from the response is weighted by how many occurrences it has. Can we use this to highlight important words? Let’s try to embolden words that have high TF scores.

‘The online system for reporting insurance claim doesn’t work on the phone.’

‘I was surprised by the speed of resolvement of my insurance claim.’

‘I paid this expensive insurance for 10 years, but you declined my claim? Unbelievable!’

That doesn’t look very informative nor helpful, does it?

Some words in the responses are clearly more important than others, but by simply counting the term frequencies, we treat words like ‘insurance’, ‘for’, ‘expensive’ and ‘speed’ the same. Of course, the clients are mentioning insurance claims, but we want to highlight the specific problems…

So what do we do about it? We could move all the common words into „stop words“ and ignore them altogether. But this would eliminate too much information and might actually be harmful.

We are smarter than that, we implement TF-IDF!

TF-IDF

This weird abbreviation stands for term frequency-inverse document frequency. It is a really cool method for information retrieval. Even though it is pretty old, it still works great!

How is TF-IDF different? It treats low-information words in a special way. TF-IDF is basically a TF that incorporates IDF — inverse document frequency. IDF penalizes words that appear too many times, which allows for a much better understanding of what‘s the most important. It is something like a measure of rarity.

We continue where we left on with the TF example. Let’s take a look at our responses through the optics of TF-IDF:

‘The online system for reporting insurance claim doesn’t work on the phone.’

‘I was suprised by the speed of resolvement of my insurance claim.’

‘I paid this expensive insurance for 10 years, but you declined my claim? Unbelievable!’

That looks much better than the previous example! We have pointed out at the phone access to the online system, speed of resolvement and expensive but declined claim. Now we know what to improve!

How is it done?

TF-IDF is really easy to compute. The exact math formula is the following:

Formula for computing TF-IDF of a word in document. Source.

You can always take advantage of already implemented methods in PySpark or in some Python library, like Sklearn. They have made the usage even easier.

TF-IDF computation in PySpark:

TF-IDF computation in Sklearn:

Drawbacks

Of course, life is not that simple and TF-IDF isn‘t the answer to everything (it‘s 42 btw). For an insurance company, it is not that important, that a client paid for his insurance for exactly 10 years. Unique doesn’t always mean important. TF-IDF also doesn’t take the word order into account. And usually just pure TF-IDF is not enough — you need to add things like LDA on top of it. But you get the idea, right?

Summary

And that’s it, you’ve made it to the end! Now you know that TF-IDF is a basic technique, that helps data scientists with information retrieval. It highlights important words, that are unique, and suppresses words with less information value. Because of this, it is useful in many use-cases that are based on textual input. If you are curious and want to try it out, go ahead. It’s just a couple of lines in PySpark or Python. 😊

--

--