Feature Investigation: Automatically Detect Drift in your Data Over Time
In the era of big data, financial institutions rely on increasingly complex real-time streaming systems that process thousands of events per second. Each event may contain hundreds or even thousands of features. In the financial crime domain, these features are used by Machine Learning (ML) and rule-based algorithms, under the assumption that future data flowing through the system will follow the same distribution as previously seen data. However, fraud patterns and customer behaviors evolve over time, resulting in data drift and potential performance degradation that impacts business and system users.
Suppose we have had an ML model in production identifying risky transactions for several weeks. During the last week we noticed a significant increase in the alert rate, i.e., the fraction of that week’s events identified as risky. Without further analysis, it is not immediately clear what is causing this increase. Often, this is due to one or more data features behaving differently from the model training period. If the model was trained with hundreds of features, analyzing each individual feature to find the culprits can be very time-consuming for a data scientist and delay the implementation of the necessary fixes in order to mitigate the data drift effects.
To solve this problem, aligned with our efforts on AI Observability, Feedzai developed Feature Investigation, an automated system that detects data drifts based on monitoring the distribution of data features. The system has both a low memory and computation footprint. The drift detection on given multivariate target data is based on a data-driven statistical test using a reference data period.
In this blog post, we’ll describe the technical details of the Feature Investigation system and how it allows Feedzai and its clients to accurately and quickly detect data drifts and issues in their risk-based systems. For more details, check out our paper from the AutoML Workshop in KDD’22. In a following post, we will describe the innovative visualizations designed to help the user investigate each feature’s behavior.
In the figure below, we present a schematic overview of the Feature Investigation system. It is composed of three main components:
- Build Reference, in which the typical feature distributions are obtained from a reference period;
- Evaluate Target, in which the feature distributions in the target period are compared with the reference period;
- Investigate, in which the feature distribution differences can be easily analyzed using interactive visualizations.
We will describe the first two components in more detail in the following sections. But first, a quick detour.
What are Moving Histograms?
Before going into more details about the Build Reference and the Evaluate Target stages, it makes sense to first introduce what we call the Moving Histograms, the core building block of the Feature Investigation solution. To put it simply, the Moving Histograms are just a way to represent feature distributions over time. In the Feature Investigation system, we use histogram representations as the medium to compare feature distributions.
The goal of Moving Histograms is to efficiently collect the state of a feature using the most recent data points. Intuitively, the most direct way of achieving this would be to use a sliding window over our data and iteratively build a histogram of feature values for each new observed data point. This approach, however, can be very inefficient because we would need to store in memory all the feature values for the data points inside that window. We would also need to know, at every step, which data points are exiting the window period and how this affects the histogram’s shape.
The Moving Histograms circumvent this issue by not defining an exact sliding window and instead assuming that all seen data points are discounted with some constant configurable rate as illustrated in the following figure.
This is exactly the same idea behind exponential moving averages (EMA)but applied to all the histogram bins. The following figure contrasts in more detail the difference between a sliding window and an EMA tail, which we discuss further next
A guide to half-lifes
The discount rate 𝜏 is what defines an effective window size, equivalent to the traditional sliding window size. In Feature Investigation, we call this the half-life of the window. The term half-life refers to the time (or the number of discount steps) it takes for a data point to be discounted to half of its initial effect.
A higher discount rate will expire the seen data points faster, corresponding to a shorter half-life. This makes the most recent events have a bigger impact on the histogram shape.
Conversely, a lower 𝜏 makes the effect of each data point take longer to disappear from the histogram, making histograms less sensitive to newer data points. This corresponds to a longer half-life.
Using these exponential discounts allows the histograms to be of constant space complexity because we only need to keep the current histogram state in memory. This means that the memory consumption will be independent of the effective window size set by the discount parameter 𝜏. This makes the Moving Histograms lightweight and good for a live monitoring system without adding a heavy additional burden to the underlying decision-making system.
We build Feature Investigation with efficiency as a major requirement because we need it to perform seamlessly in live systems and to process large-scale datasets when running Feature Investigation offline.
The purpose of the first component is to estimate each feature’s distribution during the Reference period. Typically, the Reference should comprise an extended period of several weeks or months of data, for instance, the training period of an ML model.
For each feature, an overall Reference histogram is built to characterize the data distribution during this period. Given that we typically want to evaluate the Target data in considerably shorter timescales than the Reference (e.g., in one-week periods after an ML model has been deployed instead of after several months), we compare each feature’s distribution in the Reference with its distribution in shorter time periods, at different time steps. To perform this comparison, we use a divergence measure (for example, Kolmogorov-Smirnov, Wasserstein, or Jensen-Shannon). By computing this value at various time steps, we then obtain a histogram of divergence values for each feature.
After the Reference has been built, we are ready to evaluate the feature’s distributions in the Target data and compare them with the Reference data. In this stage, the system analyzes user-defined periods of data at a given frequency (for example, one-week periods of data monitored daily). It then computes the divergence values of each feature relative to the Reference. If the divergence is larger than a set threshold, an alarm can be triggered, and the features can be ranked in a severity scale to explain the alarm.
In more detail, for each feature, after the feature histogram has been updated in a new time step it is compared with the Reference histogram to compute divergence value. This value is then located in the histogram of divergence values obtained during the Build Reference stage and the p-value for this value to be within the expected distribution of divergences is computed.
After all p-values have been calculated, a multivariate hypothesis test is applied under the null hypothesis that the divergence values observed for all the features in the Target data follow the same distributions as in the Reference data. This test requires setting a global significance level (also known as family-wise error rate), which corresponds to the p-value of rejecting the null hypothesis due to random fluctuations.
The final step is to generate an explanation to pass to the user that may help quickly identify the issue’s root cause.
Feature Investigation in practice
To get a sense of how the Feature Investigation system works in practice, we observe the results when injecting artificial drift in a publicly available real-world dataset. For this analysis, we considered 26 features, consisting of the transaction amount and derived aggregations, such as the average amount per card in a certain period. The artificial drift consists of transforming randomly 10% of the transaction amounts from their values in dollars to cents during one month, the result of either human or system errors.
In the following plot, we represent the distribution of the transaction amount using several quantile bands, computed daily and illustrated in different tones of blue.
As illustrated below, during one month, the transaction amount distribution changes significantly. The solid colored lines represent the p-values for various half-life (HL) + divergence measure (Div) configurations. The dashed lines show the p-values if there had not been any artificial drift. Finally, the horizontal red line represents the threshold used for an alarm to be triggered.
We observe that, as intended, the p-values start to decrease right after the artificial injection starts, and, conversely, they increase as soon as it ends. For the chosen threshold, two of the configurations (green and orange) would trigger an alarm earlier and for a longer period of time in the case of the orange one. Depending on the chosen configuration, the Feature Investigation system is able to detect the issue in a matter of hours or days.
For the green configuration, the image below shows two heatmaps representing the p-values of the various features over time for the cases without (left) and with (right) artificial drift. This illustrates how the Feature Investigation system is able to detect the problem not only in the original feature (the transaction amount) but also in the derived aggregations.
The example above showcases the ability of the Feature Investigation system to detect an artificially created data issue on an otherwise real dataset. But what about in the case of data with real-world data drifts and issues?
One of Feedzai’s clients, a European banking institution, used Feature Investigation to compare the data used to train an ML model to detect fraud with newer Production data. After running it for only a single day, they found that several of the features presented significant differences between the two periods for two main reasons:
- Broken features: some features had a completely different set of values in Production or no values at all. This indicated problems with data ingestion that the bank was previously unaware of.
- Data drift: some features had different distributions in Production due to different behavioral patterns. Among them were the most important features used to train the ML model in Production at the bank. This was evidence that it was time to retrain the model with more current training data.
Using Feature Investigation, this bank was able to quickly detect issues and drifts in their Production data and identify two key action points to mitigate the impact on the fraud prevention system: fix the data ingestion issues from their side and retrain the ML model with the most recent data.
Feature Investigation is a new flexible and lightweight system developed by Feedzai that detects data drifts by monitoring the distribution of several features over time. It leverages exponential moving histograms to guarantee a low computational and memory footprint and provides alarms on significantly drifted features based on a multivariate data-driven statistical test. For more details, don’t forget to check out our paper from the AutoML Workshop in KDD’22.
Moreover, the system provides interactive visualizations which simplify data scientists’ workload of investigating the behavior of each feature. This will be the focus of the next blog post in this series.
Thanks to Beatriz Malveiro, João Palmeiro, João Torres, Javier Perez, João Ascensão and Pedro Bizarro.