Managing Account Risk in Cash Product at Brex

Published in

Brex Tech Blog

7 min readFeb 12, 2022

Brex’s mission is to reimagine financial systems so every growing company can realize their full potential. Once we launched the Cash product for customers, one of our top priorities was to manage risk in a way that scales. Instead of a selective, zero-risk tolerance approach, we wanted to ensure businesses that don’t pose a regulatory, compliance, or fraud risk use our product.

In this article, we discuss how we manage account risk for the Cash product at Brex. Specifically, we focus on some key challenges that are universal across the fraud detection space, including:

Minimal labels: We don’t have a lot of fraud labels, so how can we acquire more labels and make the best use of existing labels?
Model interpretability: How can we improve the explainability of the fraud models instead of looking at a black box for every decision taken?

What is customer account fraud?

While there are several broad categories of fraudulent activities, account fraud is one of the most common. Account fraud occurs when a fraudster uses a synthetic or impersonated identity to apply for an account with the intention of defrauding Brex.

Within Brex, detection of account fraud includes, but is not limited to, the following workstreams:

New accounts: How risky are these applicants based on the information we have available in the application?
Existing accounts: How risky are the customers we have onboarded to Brex based on the actions they’ve taken?

We will be focusing on account risk for new accounts here.

What are the key challenges?

Our idea was to build a machine learning model to mitigate customer account risk and help Brex scale. We also needed to provide a smooth onboarding experience for our legitimate customers, necessitating a model accurate enough to use in automated decisions. There were several key challenges:

A small sample size of fraud applications

The new account onboarding was more manual; almost all applications in the past had involved human review. Most risky applications were rejected during the manual check, though definitively confirming fraud was typically not possible. Due to this selective, zero-tolerance approach, the number of confirmed fraud cases we had seen was quite limited. While class imbalance is a common challenge in fraud modeling, the fraud class was especially rare in our case. The model couldn’t perform well when trained on only labeled data, because this data didn’t reflect the actual data the model would encounter in production.

Interpretability

Fraud models are generally quite complex, and the business side tends to lack visibility on how the model decides on each fraud application. Introducing explainability to the fraud models is crucial for understanding the model’s predictions and providing more context to the business.

How are we addressing the challenges?

1). Human annotation and active learning

To tackle the minor sample size issue, we had to think about the best way to use existing labels. As previously mentioned, we had reviewed numerous applications, so we could create pseudo-labels by reviewing them again for detecting fraud. While we had abundant reviewed applications, manual labeling is a time-consuming and expensive process. Therefore, we introduced active learning to help us annotate data and achieve better model performance more quickly and efficiently.

Active learning is a technique to prioritize the labeling of data that has the highest impact on model training. For binary classification, the most impactful data to the label are those near the decision boundary.

Once we decided to implement human annotation with active learning, we worked with the Operations team to manually review past applications (both rejected and approved) and added detailed labels. We included approved applications to track annotation accuracy. We collected labels for each application and as much information about the accounts as possible. Then we iterated by retraining the model and selecting the next round of unlabeled data for annotation.

Active Learning Process

We built a simple binary classification model from the first iteration, based on all fraud data points and downsampled non-fraud applications. We always used the Catboost algorithm because many features were categorial strings and the algorithm doesn’t require pre-processing for these features. Then we used the model to predict all rejected applications and selected 80 rejected applications based on Margin Sampling into the pool for the annotation. The expression of Margin Sampling in our case is defined below. In other words, we prioritized data points that the current model was least confident about between the fraud and non-fraud class.

We also selected 20 approved applications into the pool. After manual review, we added the newly labeled data into the existing training set and retrained the model. Then, we applied this model to the rest of the rejected applications and chose a new set of 80 applications for the next iteration of annotations based on the same Margin Sampling criteria. We repeated this process eight times to collect 640 fraud labels in total.

Performance

While using active learning for annotation, the lack of labels meant we couldn’t evaluate the benefits of the active learning process. Now that we have more data and labels, we can backtest by simulating the annotation process. We noticed that the model can achieve the same performance with fewer labeled accounts with active learning.

2). Two-layer explainable classification model

During the annotation process described above, we asked the operation team to not only label the application as fraudulent or not but also evaluate four components of each application:

Applicant identity: if an applicant is a natural person and not using stolen or synthetic identities
Applicant profile quality
Company profile quality
Applicant linkage: the applicant belongs to the company and is authorized to file the Brex Cash application for the company

For each of these four different components, reviewers could choose from several fixed options and leave more comments. Here is an example of evaluating applicant identity.

These options always refer to a negative/neutral/positive evaluation for that component. We therefore also got labels for each component. To build a more explainable model, we decided to construct a two-layer model system. We trained multi-class classification models for each application on applicant identity, applicant profile quality, company profile quality, and applicant linkage. Gathering output from four submodels, we built a binary classification model on the second layer to predict fraud applications.

Details in the two-layer model

Each model in the first layer is a multi-class classification model. We also did feature selection based on domain knowledge to ensure features would impact the specific component. For example, we scanned applicants’ IDs and would get confidence scores from a data vendor. Those confidence scores would contribute to applicant identity but have a limited correlation with the applicant linkage component.

In terms of how we constructed the second layer of the model system, we referenced the Super Learner Frame [1] as a template. The original paper suggests an ensemble of algorithms can perform asymptotically as well as or better than the best algorithm in the ensemble. In our case, instead of combining different algorithms, we all used the Catboost model but served different class variables.

Catboost model to predict applicant identity

How the model becomes explainable

Before fitting a binary classification model in the second layer, see an example (not actual data) of output from models for a single application in the first layer below.

Output for a single application from sub-models in the first layer

applicant_identity_negative, applicant_identity_neutral, and applicant_identity_positve were output from the multi-class model of applicant identity. As every set of these three features was correlated because their summation was 1, we removed all *_neutral from the dataset and then fit a logistic regression on it. The final output was a predicted score of the fraud class.

It’s very intuitive to interpret the result of the logistic regression model. We found that all coefficients of *_negative features were positive, and all coefficients of *_positive features were negative. Also, as values of all features were on the same scale of (0,1), a coefficient with a more significant value means it’s a more important feature. Thus, we found applicant and company profile quality to be the top two most important components for evaluating applications. By this model structure, if an application gets rejected by the model, we can roughly know how the model makes the decision and will be able to locate the specific sub-model to figure out the reason.

Future Directions

We hope our description of the version zero account risk model helps other people facing similar challenges. There are, however, several areas we can improve on. First, we can do a cross manual review of each application to improve annotation accuracy. Manual reviews can’t recognize all types of account fraud, however, so we must accept some false negatives. Second, to refresh the two-layer model on new data, we need to annotate four components of some applications again. If we take a step back to consider this minimal label issue, an alternative may be unsupervised learning. Given the large amount of non-fraud samples we have, we can build an anomaly detection to defend against fraudsters.

References

van der Laan, Mark J., Polley, Eric C and Hubbard, Alan E.. “Super Learner” Statistical Applications in Genetics and Molecular Biology, vol. 6, no. 1, 2007. https://doi.org/10.2202/1544-6115.1309

_______________________________________________________________

Building the annotation and the modeling system is a huge team effort. Special thanks to Tony Ren and Emily McIlvaine for heavily contributing to the annotation system, to the Risk Operations team for working on the human annotation.