Humans in the Loop: Using AI to Get Big Tasks Done Fast

Peter Baldridge
T-Mobile Tech
Published in
5 min readAug 5, 2020

Peter Baldridge, Data Scientist at AI@T-Mobile

Picture This:

You and your team just spent two weeks collecting survey data. You now have thousands of responses including a bunch of free-form text answers. You look in your inbox and see an email marked “important.” It turns out executive management would like to see a report on the survey in three days. 😨

So what do you do?

Scenario A)

You hire a fleet of consultants to clean, label, and organize your free text responses. The process costs tens of thousands of dollars, but you get your results back in two days instead of three. Plus, they throw in an extremely polished PowerPoint deck.

Scenario B)

You call a data scientist. 😎

The Opportunity:

Every year, T-Mobile conducts an enterprise fraud risk assessment. Part of the assessment is to send a survey to several thousand employees. The survey contains multiple choice questions as well as free text questions. Processing multiple choice questions is easy but processing the free text responses is not. Each comment must be hand-labelled according to its risk category before a summary can be generated.

You could label these comments one-by-one, but that approach would be very time-consuming (not to mention boring). Or you could figure out a way to do it faster…🏃‍♀️

The Approach:

In principal, we want to bucket similar survey responses so that our fraud managers can label them in one fell swoop.

To do this, we need to create a similarity score for each pair of survey responses. If we do it the right way, phrases such as, “T-Mobile is great,” and “T-Mobile is awesome” will be considered similar.

TF-IDF and Cosine Similarity:

TF-IDF, which stands for term frequency — inverse document frequency is way to vectorize documents by highlighting unique keywords (iPhone, Galaxy, Pixel) while downplaying common words (the, of, from). It’s a way of bringing the most important information to the surface, almost like sifting for gold.

Next, we calculate similarity scores for each pair of phrases by looking at the angles between different vectors. The key idea is that similar phrases will also be closer in space.

Hierarchical Clustering (with Maximum Linkage):

Finally, we bucket our responses. We use hierarchical clustering for this step, and more importantly we used complete linkage to ensure that there is some minimum degree of similarity between each pair of responses within each cluster.

The Results:

By bucketing our survey responses, we were able to fit 2,649 survey responses into 110 buckets. That’s an improvement of more than 95 percent! 😲

What the Risk Management Team Found:

The Risk Management team was able to make the labelling process even more efficient. They did it with a couple of ingenious tricks.

Clustering ²:

In many cases there were several clusters related to the same category. Sorting in different ways as well as adding some rudimentary Excel formulas for certain keywords allowed the team to label several clusters at once. They were also able to easily identify exceptions.

To put it another way: they clustered the clusters.

🤯

We know what you’re thinking

Information-Rich vs. Information-Poor Responses:

The audit team found that there were a lot of survey responses but not a lot of meaningful responses. During previous years, responses had to be manually tagged one-by-one (often with errors). The Fraud Management team found an ingenious way to separate information-rich responses from information-poor responses. Here’s a quote from Jonathan Arras, T-Mobile’s Director of Fraud Strategy about how he tackled the issue:

“I ran through a sample of about 50 records and personally evaluated what I believed the significance of each comment to be. Then I calculated the number of characters in each cell and without looking at my manual results drew some significance cuts based on word count. The results of each approach were almost identical and while it may seem crude to say “number of words = significance” it really did prove out and automating the rest of the counting/labeling took just seconds after that.”

In effect, he created a classifier for identifying information-richness of a response based on response length alone.

TL;DR

The Fraud Management team could have simply taken our results and called it a day. But instead they applied their own understanding of both the domain and analytics to make the process even faster.

Data scientists: I think the lesson here is that there’s really no substitute for proper domain knowledge. Data science can only take a problem so far. But when you combine analytical techniques with proper domain knowledge: that’s when you unlock a solution’s full potential!

Other Interesting Findings:

ELMo and Bert:

We love using new tools as much as the next team! We tried using ELMo or BERT as opposed to TF-IDF, but neither embedding structure worked as well as our TF-IDF embeddings.

We suspect that this is because we were using pre-trained ELMo and BERT embeddings. Had these embeddings been trained on a T-Mobile vocabulary, they may have done much better. It just goes to show how powerful the classical tools are, especially in a pinch.

Fat-Tailed Distribution:

Interestingly, bucket sizes followed a fat-tailed distribution. This is a common phenomenon in nature.

Citations:

https://en.wikipedia.org/wiki/Gold_panning#/media/File:Gold_panning_at_Bonanza_Creek.JPG

https://stackoverflow.com/questions/40716459/choice-between-an-adjusted-cosine-similarity-vs-regular-cosine-similarity

https://commons.wikimedia.org/wiki/File:Swiss_complete.png

--

--