How we built a Machine Learning ensemble to tackle fraud detection in survey data

At Attest, we provide a platform for our clients to design and create surveys. Survey respondents are sourced using ‘panel aggregators’ who recruit for their panels based on their country of residence. Surveys target certain quotas, which are combinations of demographic features representative of the population of interest. Finally, we transform respondent answers into actionable consumer insights.

Providing accurate consumer insights is a complicated undertaking. Survey respondents sometimes act in certain ways that are not conducive to high data quality. Garbage in, garbage out (GIGO) — the concept of nonsense inputs producing nonsense outputs — is a real risk in market research. Research shows that only less than half of companies are confident in the quality of their data, and one-third of projects often fail due to bad data quality.

In our industry, there are numerous pitfalls that can lead to GIGO: sampling biases, improper survey methodology, inappropriate statistical assumptions, and simply bad quality data. Today, I will do a brief introduction on how Attest approaches the problem of data quality algorithmically.

In market research, quality checks are typically implemented using a set of hard-coded rules. Such implementations cover a variety of cases, including but not limited to:

Comparing the language the survey is written in to that of the respondent’s locale
Detecting mismatched demographics — e.g. where a respondent places themselves in the ‘don’t have kids’ category while also supplying the age and gender of their children
Removing respondents who display extreme behaviours such as excessive skipping — i.e. people who skip questions to reach the end of the survey as quickly as possible.

However, even though these rules are individually effective in detecting bad quality responses, they are limited to binary outcomes (which leads to a loss of nuance) and there is no information exchange between them. Respondents who provide gibberish (‘sdkfhg’) answers are filtered out, as are those with mismatched demographics or people who skip the majority of the questions. But what about those who skip some questions, while also providing low quality open-text answers that are not straight gibberish? Such respondents will not trigger any individual rule, but in the aggregate, the answers they provide will be of low quality. Thus, our challenge was: how can we use combinations of indicators of low data quality to improve our overall data quality?

As the only data scientist member of the new cross-functional squad formed to address audience quality, it was my responsibility to come up with a solution to tackle the issue of bad data quality. I devised a two-pronged plan of action:

Replace hard-coded rules with flexible models, and
Combine the learnings of multiple models into a single prediction.

The First Step

I achieved the first step by developing several machine learning models in Python to replace the rule-based quality checks. Some of these models are described below.

The Speeding algorithm can detect bots and exceptionally fast humans. For the latter, we wanted to set an expectation of how long a given survey would take to complete. Using historical data and bootstrapping, I simulate 1000 surveys with the same number of questions with the exact question types used in the actual survey. For example, a survey might consist of 1 grid, 3 single choice, and 5 multiple choice questions. After controlling for question characteristics such as question length, the simulation produces a distribution of expected duration, given historical precedent.

The Answer Positions algorithm can detect any autoregressive pattern of answers where past observations are highly predictive of the next observation such as:

Flatlining (1,1,1,1,…)
Diagonals (1,2,3,4,… and 9,8,7,6,…)
Repetitions (1,2,1,2…)

and so on. Another position-related data quality concern is overclaiming. Overclaiming is the tendency of some respondents to exaggerate their answers in multiple choice questions. For example, a question like ‘Which of the following brands have you heard of?’ incentives respondents to select all the options in case the survey creator only wants to pay for respondents who are already familiar with their brand. To address this, I built a model with a dynamic threshold for overclaiming. This provided us with flexibility to cover a wide range of overclaimers ranging from respondents who select all options in a single question to those who consistently select most of the options in a multitude of questions without ever selecting all answers in a single question.

For open-text answers, we developed various NLP algorithms aimed at gibberish detection, swearing removal, and a component for assessing question-answer relevancy that can handle emojis in answers.

Finally, thanks to the work of our data engineers, we have infrastructure in place that allows us to evaluate any incoming data point for both response characteristics (speed, positions) and open-text data.

The Second Step

Now armed with a set of algorithms, the next step was to address the original issue: the algorithms proved to be an improvement over their rule-based counterparts, yet they were still operating individually. In my PhD dissertation, I utilised ensemble models to pool predictions to increase the accuracy of individual models when forecasting armed conflict duration. Based on my research, I was aware that the key to successful ensembles is having model diversity.

This makes sense: models can be thought of as our flawed abstractions of the real data-generating processes that are inherently complex. As a result, similar models make similar predictions; they think alike. By diversifying our models, we maximise the information content — higher entropy, if you are into information theory!

In the end, we decided to weight the individual model predictions to produce a composite respondent score. In doing so, we are now fully equipped to detect low quality respondents even when they don’t trigger any individual data quality algorithm — the composite score will raise a flag as a function of all the actions of a respondent.

The respondent score approach made possible by the fraud detection ensemble also informs our new sampling framework — Intelligent Sampling (IS). IS is a novel framework that allows for panel optimisation based on respondent scores, which provides us control over costs, survey time-to-fill, and most importantly, data quality.

Stay tuned for the next Attest Data Science blog post for a deep dive on Intelligent Sampling!

How we built a Machine Learning ensemble to tackle fraud detection in survey data

The First Step

The Second Step

Written by Gokhan Ciflikli