Algorithmic Fairness in Fraud Prevention
A practical perspective on how to introduce algorithmic fairness into the development of automated counter-fraud decision-making systems
By Hao Yi Ong and Yaniv Goldenberg
A while ago, a media company used a name scoring algorithm for spam detection. When the company launched in a new market with a different language and culture, its spam detection algorithm resulted in abnormally high false positive rates for good users in that country. This bias wasn’t due to any intent to discriminate, but due to patterns in the original training data that were different in the actual user population. It turned out that the training data didn’t contain enough examples of names common in that culture, and the model ended up treating them the same way it treated gibberish names commonly associated with fraudulent accounts. The developers simply weren’t aware that some cultures use first names that are predominantly two characters long.
Aside from moral and public issues, unfairness has its own business cost. Denying service to many good users caused negative user experience and slowed product growth. To be sure, the company cared about fair model development. But a system is only as good as the humans that generate its training data, and all humans have biases and can sometimes be ignorant, i.e. not possess sufficient data. Despite the company’s good intentions, the system unfairly discriminated against good users based on their culture because developers failed to fully account for biases in what they deemed to be a fair development process — getting it right was harder than they realized.
Automated decision-making relies heavily on statistical and machine learning algorithms. From dynamic pricing to support ticket routing to fraud detection, these algorithms have outsized impact on user experiences. In some of these cases, predictive models are trained to predict a target that was defined in a biased way with respect to sensitive user attributes. In the case above, bias resulted from the lack of representation of an hitherto unobserved segment of users, thereby skewing the distribution of training targets. Sensitive attributes include what would constitute legally or socially protected classes such as race, religion, gender, sexual orientation, etc. Given the importance of these algorithms to the user experience, we must be careful that they make fair decisions with respect to these attributes.
In this post, we focus on algorithmic fairness in the domain of fraud detection and countermeasures. While fairness is also a crucial factor in other domains, it’s especially important to fraud because we make drastic user-impacting decisions based on whether we think a user will cause financial harm. For instance, once a user incurs substantial chargebacks on transactions that look like they’re from stolen credit cards, we may deny further services to that account until the user can justify the disputed transactions. This sort of fraud countermeasure is similar to how banks deny credit or loans due to low credit score, or how payments companies flag or block transactions that appear to be instances of money laundering.
Fraud countermeasures are meant to protect good users and company from online and financial harm. But making wrong decisions to block users from services incurs high user friction and churn. Worse, fraud decisions that are wrong and unfair cause loss of trust in a product. A fair process of algorithmic development ensures data scientists pay sufficient attention to fairness in the machine learning and feature development processes. Alone, however, a fair development process doesn’t always guarantee that the algorithms developed will be fair. In fraud, it’s also important to incorporate timely manual reviews that help reveal inconsistencies in decision-making with respect to sensitive user attributes. We believe that a focus on the fairness of algorithmic development augmented by the constant monitoring of algorithms via a proper review framework is the right approach within the unique context of fraud.
Overview
- Defining fairness
- Making fairness operational
- Automating decisions
Most of us have an intuitive sense of what’s right and wrong that informs what we call fairness. To form the basis for discussion, we survey the existing literature surrounding formal definitions of fairness. We then delve into how they apply in the context of fraud in mobile apps. We show how the operational context, in turn, informs how we can design and develop counter-fraud services such that the algorithms and product experience together generate decisions that are as fair as practically possible to users.
Defining fairness
How can we formally define fairness in a way that’s consistent with intuition? In the context of fraud detection, is the fairness of outcome always achievable or even measurable? How does it factor into everyday decisions about the algorithms we build and the user experience we craft?
Here, we’ll briefly survey the three competing notions of algorithmic fairness: (1) process fairness, (2) algorithmic performance equality, and (3) statistical parity, which are also described in Johndrow and Lum’s article.
Process fairness
The first approach to algorithmic fairness asserts that disregarding protected attributes in the process of developing, say, a machine learning model will result in a fair model. The idea is simple but problematic. The first issue is that protected attributes can be redundantly encoded in other terms. As a toy example, suppose we were interested in modeling some behavior about pre-WWI American citizens with gender as a protected attribute. We decide to omit it from model development, only to find that model gives strong weight to the number of times that person was eligible to vote. Why? Well, because Women’s suffrage was only granted with the passing of the 19th Amendment in 1920, which occurred after WWI. In other words, the voting eligibility feature was almost perfectly correlated with the citizen’s gender. And even if we conducted a simple search and delete on highly-correlated features, there may be higher-order interactions between features that produce essentially the same information encoded by protected variables. Because of their correlation with permitted variables, the impact of the protected attributes remains.
To achieve fairness, it’s thus important to acknowledge that it’s not a trivial matter of ensuring the process is fair, but also a matter of evaluating algorithm on metrics of fairness. This is the basis of the two following notions of algorithmic fairness.
Algorithmic performance equality
The second school of thought argues that fairness should be determined in terms of the similarity of model performance across protected categories. The literature can be a little daunting here as the differences between specific definitions are subtle. For example, “treatment equality” focuses on achieving similar ratios of false negatives to false positives across protected categories, whereas “conditional procedure accuracy equality” focuses on achieving similar false positive or negative rates across different categories. Generally, we’re interested in having comparable model prediction accuracy with respect to protected attributes; i.e., we should make as many mistakes on one protected group category as any other. The downside here is that we’re often trading off model performance for performance equality, and it’s not always clear how that tradeoff should be determined: Should we sacrifice 5% in user churn for a 2% improvement in performance equality? Or is 3% in fraud loss prevention worth a 1% decrease in performance equality?
Statistical parity
The third notion, statistical parity (also called “demographic parity”), is achieved when the marginal distribution of the predicted values are similar conditioned on the protected attributes. It stems from the legal principle of “disparate impact,” where a practice is considered to be illegally discriminatory if it has a disproportionately adverse effect on members of a protected group. Mathematically, a model is considered fair if the distributions of the model’s predictions conditioned on protected attributes are close enough as determined by some measure of similarity. For classification problems, a reasonable measure is the Kullback-Leibler divergence between the model’s inferred probabilities conditioned on the protected features. A natural criticism of this notion of fairness is whether we really want equal outcomes across groups with respect to some protected attribute. Suppose, for the sake of argument, fraudsters like to pose as men on an online clothing store. Are we okay with declining the orders of female customers who pose no fraud risk just so that the same proportions of men and women’s orders are flagged? There’s also the argument for affirmative action, where unequal outcomes favoring disadvantaged groups are preferred to bridge past and existing inequalities, as was partly the case for the under-served Chicagoans. To give another counterargument, there are companies that provide microloans to poor entrepreneurs, where the business models often specifically target members of historically disadvantaged groups.
There is no one-size-fits-all notion of fairness. Each definition has its advantages and drawbacks, and the best (or, perhaps, least worst) choice depends on the problem context. Practically, even if we did decide on one, we may not have enough information to compute the fairness metric. It thus becomes important for us to adopt a practical concept of fairness that works in this context. In the same vein, we need to constantly reevaluate whether definition is still relevant and adapt it to changing circumstances.
Making fairness operational
Algorithms help scale decision-making but they make mistakes. In fraud, we want high recall and a low false positive rate. Recall is the the fraction of fraudsters we can correctly detect. The false positive rate is the fraction of good users that we misclassify as fraudsters. These are usually competing objectives. Trivially, we can simply block everyone in the system to maximize recall, but that would mean a 100% false positive rate. On the other extreme, if we don’t block anyone from using the system, we’d get zero false positive rate but also zero recall. Only by building an accurate fraudster-detector can we get both high recall and low false positive rate. Like in any large automated system, false negatives and false positives can be minimized but never completely eliminated. In the case of fraud, we must consider these two competing objectives with a definition of fairness that’s operationally practical.
Fairness in fraud means that a good user’s protected attributes won’t affect how likely they will be mistakenly classified as a fraudster or not. To state the obvious, we’re not interested in fairness for fraudulent users because (1) they harm the overall system when they, say, deprive good users of service or directly harm other legitimate users, and (2) it’s unlikely that we can obtain accurate user information from fraudsters to begin with. We’re thus only interested in fairness for good users. To avoid algorithmic bias, one robust approach is to begin with process fairness, and then additionally audit changes to algorithms on held-out samples for disparate impact. It’s also essential to provide easy recourse for users who are flagged as false positives by counter-fraud algorithms.
That said, quickly measuring an algorithm accurately is not always feasible and we have to make certain compromises in the pursuit of algorithmic fairness.
Measurement challenges
To even begin talking about measuring system performance, be it for general model performance or fairness, we need to have labels for fraudsters and, conversely, good users. There are, unfortunately, two limitations with regard to having good labels. Namely, delays in knowing whether a user is fraudulent, and the difficulty of knowing reliably what a user’s protected attributes are.
In consumer finance, banks allow their credit card users to file disputes against transactions that their users believe to be wrongly charged to their accounts. Consumers might want to file a dispute with their bank when, say, their credit card was stolen and used to purchase a ride or delivery or other in-person service. Credit disputes, or chargebacks, can be filed months after a transaction has occurred to protect a bank’s customers. As a consequence for counter-fraud teams, unfortunately, the delay between the actual fraudulent transaction and the victim company’s receipt of the dispute can be up to half a year or longer.
Chargebacks are a pretty precise indicator of a fraudulent or compromised account with the only exceptions being friendly fraud and false positives by the bank’s detection system. while we rely on many signals to determine whether an account is fraudulent, chargebacks remain one of the strongest indicators of fraudulent behavior within our basket of counter-fraud signals. It’s thus difficult to establish a reliable “good user” label and metrics for algorithmic fairness in the short-term. This challenge is especially pronounced when we need to quickly iterate on models and product features to respond to emerging fraud trends, which can drastically affect user distribution. Worse, we have no prior information when we launch new products or in entirely new markets. Further complicating the lag problem is the speed at which fraudsters evolve. We’ve seen fraudsters adapt their tactics in a matter of days to make them near-indistinguishable from good users, rendering non-chargeback indicators of fraud less reliable.
Recall the ‘Algorithmic Performance Equality’ section above, where we assess detection performance within different protected groups. In order to protect users’ privacy, we should avoid ask for or try to infer these attributes, thus making it difficult for us to tell anything about their demographics. And even if for some reason we tried to, there’s no guarantee on the user’s truthfulness. For instance, the user may use a throwaway email address for a system that’s registered to a Kryptonian domain when he really lives in Kansas. Users also provide pseudonyms that mask their true identities. If nothing else, it’s for the same reason I say “Howie” when my barista asks me for my name. 🙄 With the exception of service providers who conduct vetting processes (e.g., large merchants on eBay, hosts on Airbnb), it’s hard to believe that users provide useful indicators of their protected attributes. And even in case where we do have some sensitive information, as may be the case with merchants on marketplace apps or other in-person service providers, we should severely restrict access to that data internally.
From a security perspective, we should enforce strict use policies and security around sensitive attributes. Being lax around such data is a disaster waiting to happen. To measure fairness, one might suggest discovering users’ protected attributes or even purchasing them from third parties. But this would open the door to worse problems. Callously introducing such data into an enterprise system with thousands of users with different domain knowledge and use cases risks having these attributes “leak” into future models. Developers joining the company in the future may unknowingly pull this data by mistake and cause bias. For this reason and more, it’s best to not store nor even obtain sensitive attributes.
So using a metric for fairness is difficult in practice. But we obviously can’t simply assume that resigning ourselves to maintaining a fair process ensures a fair outcome. Some proxy measure of the fairness of outcome must be incorporated into algorithmic development.
Learning fairly and sampling measures of fairness
While straightforward to define, we’ve seen how the fairness of outcome can be hard to measure. But solely relying on the fairness of process is insufficient to guarantee fairness. In the social media company example presented in the introduction, we saw that it’s easy to unfairly discriminate despite having a seemingly fair process — developers built the models to be “language-unaware” and everything seemed fine in development until the market launch. The oversight could have been prevented and was eventually corrected after data scientists examined the data and noticed the algorithm’s disparate impact on good users from the target market. Arguably, it would not have happened had the process included proper research. But incrementally discovering and arresting edge cases in existing processes is never-ending. Lapses can and will happen in spite of apparent process fairness. Without reference checks, it’s hard to guarantee outcome fairness.
One solution is to adopt a best-effort approach that focuses on the fairness of process while testing models at critical junctures of the algorithmic development and periodically after new launches. In such tests, strict training procedures ensures that the outcomes are not dependent on sensitive attributes or combinations of them. There’re also careful manual reviews on representative sample sets to estimate how well the model performs. These reviews can happen while “shadowing” the model (running it and logging results but not taking action) to see how it would perform given the existing system, and when A/B testing before launch. Here, the metric is the similarity of false positive rates across protected groups.
To be more specific, manual reviews should be conducted by experts such as risk ops agents trained to be sensitive to these differences. In a typical review, detailed reports of why a user was a false positive help us manually verify flag rates on attributes that we intuitively think are correlated with protected features. These attributes include names, emails, telephone country codes, service request locations, etc. A data scientist would work through these reports with the agents responsible and determine if there are biases on protected attributes. For instance, if models are flagging a lot of users with a set of IP addresses associated with a neighborhood whose residents are predominantly from a historically disadvantaged group, is it because there’s a big crime syndicate operating out of there, or is it because the model is being unfair?
An ethical scientist must think about fairness in every stage of the algorithmic development process: training data mix generation, feature engineering, and model training, selection, and validation. Coupled with constant reevaluation of the development process, they should rely on a battery of manual reviews designed to weed out edge cases of algorithmic bias and augment the fair process approach.
Automating decisions
Now that we’ve outlined a strategy for algorithmic fairness, we’ll dive into some of the fair process considerations when we design a counter-fraud service. The decision-making part of such systems generally comprises business rules and machine learning models that use features that help discern fraudulent behavior. Once the system flags the most suspicious accounts, they are denied services until they prove that they’re indeed legitimate users by passing counter-fraud “challenges.”
Designing business rules
Fundamentally, business rules and models are mappings of features to binary decisions on whether users are fraudulent. The only difference is that business rule mappings are handcrafted by fraud analysts whereas model mappings are optimized using machine learning tools like linear regressors, decision trees, and neural networks. Machine learning models allow us to make sense of the hundreds and thousands of features available and generate precise decisions at scale. But because of the labeling delay, it’s hard to train models to capture shorter-term trends. Business rules complement machine learning models by providing a way for developers to generate temporary defenses that stem emerging threats until the chargeback signals surface in the financials and make their way into the models.
Due to the simplicity of feature mappings and reliance on analysts — the same analysts who’re trained to do manual reviews for machine learning models that help discover patterns of unfair bias — business rules are less susceptible to algorithmic unfairness. The primary focus is on educating analysts on how to use features and how to develop them, with the added buffer of having the same people constantly pore over the rules’ output.
In feature usage, we must take care that the signals target a specific pattern of how fraudsters defraud us. For instance, even if it turns out that for the past week every account with an Indian phone number was fraudulent, we have to investigate these accounts to tease out more indicators of fraudulent behavior. Simply blocking all Indian phone numbers unfairly penalizes bona fide users who might be traveling into a region we operate in. In feature development, it’s vital to avoid using signals that are highly correlated with protected attributes. Signals like geographic latitude and longitude are susceptible to unfairly targeting specific neighborhoods, which can correlate highly with protected attributes. Consider a high-tech crime gang that just so happens to be operating out of an affluent neighborhood where most good users have their own personal chauffeurs. Because the majority of services requested in that neighborhood would be by these fraudsters, we’d think that the neighborhood is really fraudulent and block good users who may one day decide to give their chauffeurs a day off. In business rules, make sure to develop and use features that target the specific ways that fraudsters exploit the platform, rather than ones that naively target the symptoms of their activities.
Building machine learning models
As in business rules, machine learning models can be trained fairly by following good development guidelines. Generally, these guidelines pertain to ensuring good data usage and understanding on the developer’s part.
At the minimum, we need to make sure we’re not carelessly using protected attributes, directly or indirectly. If we are, we want to make sure that the outcome is verified to be fair with rigorous checks. Direct usage may result from merely having this data in an enterprise environment as discussed above. Indirect usage is harder to detect and requires analysis. If we are able to predict a protected attribute from the training set, with high precision, then there’s danger of indirect usage of that attribute. For example, a feature set that includes high-resolution geolocation, time of day, and type of card may predict low-income neighborhoods and correlate with race or ethnicity. In order to minimize this effect, it’s a good idea to apply smoothing and regularization. Smoothing the geolocation data prevents hard neighborhood borders, and smoothing the time of day minimizes penalizing “riskier hours” when fraud activity occurs with higher frequency. Regularization, as well as the size and richness of the feature set, help ensure that no one feature dominates the outcome. In any case, this effect should be verified using feature scoring algorithms to find potentially “dominating” features. If models trained on those features are prone to biases, they require re-engineering or greater regularization.
Another problem that leads to unfairness is data coverage. This issue arises when using user-generated content (UGC). Any information entered by the user on an app or website is counted as UGC. In general, UGC has cultural information even when relatively short, such as names. To build fair algorithms that use UGC as input, they should be trained on large and diverse data. The diversity required may not, however, be present in the training data when using only historical data. This was the case in a previously discussed example, where a spam detection algorithm was trained on a dataset that wasn’t diverse enough and, when exposed to a new market and culture, made too many errors. Additional steps should be taken to prevent such situations, such as greater research or simply weathering financial losses until enough training data is available.
An important principle to follow is that model developers must have complete knowledge of the features used. This requires feature and model developers to either be the same persons or work closely in collaboration. Companies should avoid separating feature engineering and model training development into “departments” that meet weekly or communicate via the management chain. Treating features as an abstract set of variables like {f1, f2,…} creates pitfalls for developers who do not realize the impact of features on algorithmic biases. A good team of fraud model developers should implement or review every feature down to the production code and have lengthy discussions with their authors. We often have several iterations for every feature until we’re satisfied with its performance and fairness. Even after that, we take care to continuously review model performances to guard against distribution drift in features.
Another guideline that’s important in the enterprise environment is model independence. Companies often have many models developed by many people, some of whom are no longer with the company. As a result, knowledge gets fragmented and partially lost. It is thus important to learn only from objective historical results such as chargebacks or unresolved debt, and never use predictions by other models as training targets. We don’t want to inherit unfair outcomes from models we don’t completely understand.
Note that a prior model that did or did not act on a particular user may affect new model training even if we don’t use the decision from the old model. That’s because the system might have taken an action on the user, which introduces bias. For instance, we ping the payments provider for the validity of a user’s credit card if we suspect a user of being fraudulent. That, in turn, gives us signals about whether the user’s card is valid. If the provider has a high chance of failing because of network issues that we’re unaware of, that might obfuscate the fact that the user is actually good but just happens to have an account with a bad provider. Another infamous example is a risk rating algorithm used by some courts, the output of which is used in sentencing and parole hearings. The algorithm was trained on both prior objective results (actual crime) and prior decisions (convictions, sentencing, etc.). Bias present in prior decisions is thus inherited by the model. To make things worse, the algorithm is a black box risk-assessment tool that isn’t fully understood by the court using it to inform its decisions. One way to mitigate these issues is to use control groups for all actions and keep some data “untouched” by models. This mechanism provides the data we later train on, with the benefit that it ignores decisions by prior models and is thus unaffected by system actions. Not doing this can result in a vicious cycle that reinforces unfair decisions.
Providing recourse
Even after taking all those steps to ensure algorithms are fair, we inevitably slip up. To alleviate the pain that good users go through if they were flagged as false positives, we can design the product to err on the side of giving users as much opportunity as possible to prove that they’re legitimate. For users that are flagged for suspicious behavior can be asked to pass what we call “challenges” to prove that they’re the card owner. This concept is similar to how some products are protected by two-factor authentication and CAPTCHA (not an endorsement of these features; just the general idea). Credit card ownership challenges can involve providing known information, like their date of birth, or adding hard-to-fake information we don’t already have, like taking a photo of their drivers’ license. In the worst case — perhaps because the user accidentally input the wrong data too many times and got blocked due to rate-limiting — this may require appeals to customer support agents. But that’s one of the more painful user experiences to avoid and we actively work on challenges that provide alternative forms of recourse.
As an added benefit, challenges help improve decisions about a user’s trustworthiness. Someone who successfully passes the challenge also strengthens the system’s confidence in her identity. Conversely, multiple failed challenges make it more likely that a user is fraudulent, even if they later pass a new challenge. This way, these challenges can generate a training set for future business rules and machine learning algorithms.
Belatedly, why this post?
Algorithmic fairness is an old topic that has just started entering the public consciousness. Sensitive decisions are increasingly being made by artificially intelligent systems. And that’s no different for most of the apps that anyone uses. Unfortunately, most of the existing literature is focused on principles and measurement methods rather than the practicalities of operationalizing fairness. Dealing with algorithmic unfairness at scale in our messy world is hardly a weekend problem set, especially with evolving norms that spotlight social and cultural biases that weren’t apparent before. This post was borne out of the authors’ hope that everyone from budding data scientists who just got out of grad school to experienced data science managers would be able to gain some insight here and create better products.
Disclaimer: The views and opinions expressed in this article are those of the authors and do not necessarily reflect the official policy or position of any employer that either have worked or are working with. Examples highlighted in this article are only examples. They should not be treated as factual and they are based only on very limited and dated information. Assumptions made within the analysis are not reflective of the position of any of our current or prior employers.
