The Summer of ‘19
One year before finishing my Master's, I decided to spend my summer learning and gaining experience in the fields I’m focusing on: data science and machine learning.
Now that summer is over, I realize this was an excellent decision. But that’s because I was lucky enough to end up in one of Feedzai’s summer internship programs. Three months have passed since all the interns, who were allocated to a great diversity of projects (not only in data science and machine learning), were cheerfully welcomed by everyone at the office.
Before getting all techy, I must thank Feedzai for providing me with such an amazing experience. I have learned at a remarkably fast pace, and I never felt like I was “just” an intern; instead, I was always treated and trusted like other members of the team. Without further ado, let's get to the point of this post.
One of the things Feedzai did that I admire the most was that they assigned interns to projects of actual relevance. In my case, I joined the Research Data Science team, where I was challenged to study and experiment with the application of active learning to fraud detection.
What is active learning?
Well, imagine an entity that processes payments (e.g., a bank): they likely have hundreds of thousands of clients making millions of transactions per day. You suspect that a very small portion of those transactions is fraudulent — anything between 0.1% and 3% — but you have no idea which ones (for some reason fraudsters refuse to admit they are committing fraud).
Without a machine learning model, there are two main ways to identify which of those transactions are fraudulent:
- Chargebacks — these occur when clients complain that they didn’t make the transaction and it is assumed that the account or card details have been stolen. Chargebacks would perfectly fix the problem of obtaining labels if it wasn’t for the huge drawback that they could happen after a few days/weeks after the transactions have occurred or possibly even never take place.
- Analysts’ feedback — This is received when transactions are sent to an expert who analyzes and labels them manually, one by one, as fraud or legitimate.
If you wish to build a machine learning model to classify transactions, ideally you would leverage both methods.
But which transactions would you send to the analysts? Every single one of them? Poor analysts… Especially since 90% of their time, at the very least, would be spent reviewing similar transactions. In effect, there would never be enough people for such a task. So, how can you pick which transactions analysts should be looking at? How can we go about creating, as we call it, an intelligent review queue?
That’s where active learning comes in. Active learning is the field that covers the following situation:
- You don’t have any labels at all for your data;
- You implement a policy that decides what data the model will be trained on;
- You try to find the “ideal” model with the smallest possible amount of data;
- The overall goal is to minimize the costs of sending examples to an oracle that labels them (in our case, the analysts).
It’s important to understand how transactions would flow within an active learning framework. Let’s analyze the components in the following diagram:
- The client’s transactions happen in real-time and are represented by the transaction stream;
- As transactions come in through the stream, they are being stored in the unlabeled data pool, where all the unlabeled transactions are stored (i.e., we don’t know which ones are fraud or not). This pool is constantly increasing over time;
- This is the “heart” of the active learning framework. The gears on the left represent an active learning policy, which selects a batch of unlabeled transactions that it considers the most relevant to send to the analysts.
- After being labeled by the analysts, the transactions from the batch, together with their brand-new labels, are added to the labeled data pool. With this pool, we are now capable of training a machine learning model.
- Now we can start iterating. In every iteration, we can use the information available on the labeled pool, the unlabeled pool, and the current model to update the active learning policy, so it can decide which transactions would be the most relevant to label next, increasing the labeled pool.
Disclaimer: In such a preliminary phase of this research, we can’t actually send transactions to the analysts because we simply can’t afford all that time from analysts for each experiment. Given that we have the labels of all the data, what would happen is the following:
- At first, all transactions are treated as if they are still unlabeled.
- When the policy decides that a batch of transactions should be “sent to the analysts," the batch is moved to the labeled data pool and the model starts to also train with those transactions.
Measuring the policy’s performance
We need to find some way to know how accurate the machine learning model is each time the labeled data pool grows (i.e., after new transactions are reviewed by the analysts).
In a real use case, the lack of labeled data keeps us from having a test set to evaluate model performance. Nevertheless, we are still in a very early stage of the project and we want to know how different active learning policies perform, so we split our dataset in half and, being the data time-dependent, we use the latest split as a test set. Furthermore, in a real situation, we would tune our active learning policy with a dataset from a similar domain, so this analysis is still relevant.
Thus, our experiments are organized as follows:
- New transactions come “from the analysts” with labels;
- Our implementation retrains the model from scratch on the full labeled data pool;
- We use the aforementioned test set to obtain some performance metric (e.g., accuracy, recall, area under ROC curve, etc.);
- If the maximum querying budget we established (e.g., querying 10k transactions) is not exceeded, the process repeats.
How well is an active learning policy doing? To answer that question, we must have something to compare it with. So, we established two simple baselines:
- The pessimistic baseline — When the active learning policy is simply sending the transactions to the analyst in a completely random manner. We definitely don’t want to do worse than that, right?
- The optimistic baseline — Given that our dataset actually has labels and we are just pretending it doesn't (for the sake of experimentation), we can use all of the data to train the model and find what, supposedly, is the best model one can hope to get.
In this plot, you can see a run of our pessimistic baseline. The blue line represents the performance of the model throughout the querying process. The dash-dotted horizontal red line is the optimistic baseline. And it appears that the model stabilizes there once it has nearly 4k labels.
Not so bad right? The model trained with 4k transactions is as good as the optimistic baseline, which was trained with 150k transactions. But don’t forget this was a random run, which means it could have been worse. It might just depend on how lucky you are.
So, how do we get a visualization of the margin between the worst and best-case scenario? Easy (but computationally expensive)! We run the same experiment you have seen above hundreds of times, with nothing different besides the random seed, and then we plot the distributions of all those plots into a single one.
In the case of random sampling (a.k.a., the pessimistic baselines), the result is this:
What we can deduce from this plot is:
- The model usually works well when it already has access to 4k transactions, as we had seen in the plot of figure 3.
- However, even around 5k transactions, there are still some runs where the model performs worse than others.
What we achieved so far
After going through the current state-of-the-art methods in active learning (if you are interested, I recommend this paper to get started), we implemented several active learning methods, some solely based on the current literature and some with a few tweaks of our own.
After a lot of experiments, the best result we achieved is the following:
We managed to make the distributions much “thinner” and to make at least 83% of the runs stabilize at around 1k transactions.
Yet, there are still a few scenarios where the model obtains a lower score around 2k transactions, which is not yet optimal.
We achieved promising results that make us believe we are on the right path. We have many ideas related to active learning policies that still need to be tested, and we believe that we’ll find a robust one that always achieves stable top performance at 2k transactions (at least!).
I hope this post provides you with some insights on active learning and the methods we used to evaluate its policies.
This post is not about Feedzai. However, I cannot end things without congratulating its people for creating such an amazing culture. It’s crystal clear that the company cares about the well-being of all its employees, and despite the fact that everyone is free to manage their work in a way that best works for them, you can sense that Feedzaians share the same grit and desire to tackle fraud and push Feedzai to even greater heights.