How We Built a (Mostly) Automated System to Solve Credit Card Merchant Classification

Published in

Brex Tech Blog

10 min readDec 21, 2021

If you’ve ever looked at your personal credit card transactions, you may have noticed that a lot of merchants are listed with obscure names. An Amazon purchase might say, “AMZN Mktp CA,” or a purchase at Little Skillet in San Francisco might read, “TST* LITTLE SKIL”.

In some cases, the merchant may not even be recognizable. Category descriptions often aren’t any better. On my personal credit card, a purchase from a restaurant called, “Berkeley Social Club” was categorized as a gym, and a refund from a hotel was categorized as a credit card payment by my financial aggregator.

While not having clear merchant descriptions for personal charges is an inconvenience, it becomes more problematic at a business level. A large company can have thousands of transactions per day, and not being able to easily identify and categorize spending can make it hard for businesses to understand their cash flow. It also makes it much more difficult for them to quickly and accurately close their books.

Brex’s mission is to reimagine financial services so that every growing business can realize their full potential. To create a leading corporate card, we knew that we needed to do better than previous cards by providing our customers with clear “doing business as” (DBA) names, and websites, as well as an accurate categorization that helps businesses to track and understand their expenses. (See below for a screenshot showing how we enhanced the merchant descriptor from MasterCard, TST* LITTLE SKIL, with the correct name, category, and website.)

To achieve this when we first launched, our amazing and diligent support team would actually manually Google every single new merchant that showed up in Brex transactions to identify their DBA name, website, and provide an accurate category. Of course, this wasn’t scalable and once we started seeing thousands of transactions daily, we needed to find a better solution. As one of our early projects, the Data Science team at Brex automated this task of identifying and categorizing merchants. We did this using a mixture of Google Places, machine learning, and Amazon Mechanical Turk. Since our support team referred to this task as “Merchant Cat”, we called our system “Auto Merchant Cat” and plastered our presentations, documentation, and tooling with images of cats selling things.

Here’s what Auto Merchant Cat looks like, and how we created it as a scalable solution.

This is Mina. She steals pens from Ian Finneran of the Brex Data Science team and sells them back in exchange for cat treats.

Auto Merchant Cat’s Architecture

Auto Merchant Cat has two key steps: identification and categorization. Identification ensures that we have the DBA name and merchant website. Categorization maps this merchant to one of 48 Brex categories that are more amenable to business accounting. Only merchants that have been successfully identified can move on to categorization.

Automated identification uses two components, Google Places and Mechanical Turk, with the more expensive Mechanical Turk component only used if the former fails to identify the DBA name. Each component has acceptance criteria that must pass before the results are used. If neither component passes acceptance criteria, then the merchant enters the support queue for manual identification and classification.

If identified by Google Places, the merchant is submitted to the merchant cat classifier, which takes the MCC (Merchant Category Code) and a set of tags (place types) provided by Google Places as input and outputs a prediction for the category. If the categorization passes acceptance criteria, then both tasks have been completed and the results are sent to a production service that stores and maintains the data.

If Google Places does not identify the merchant, it is sent to the Mechanical Turk component where workers registered with Mechanical Turk are used to identify the merchant. Once identified, a separate classifier that takes the DBA name and MCC code as input attempts to classify it. If the classifier fails to provide a result that passes acceptance criteria, then the identified merchant is also sent to Mechanical Turk for classification. As with identification, if Mechanical Turk results fail to pass acceptance criteria, then the merchant is sent to Support for manual classification.

Lastly, a small random sample of merchants that are both identified and classified by Auto Merchant Cat are still sent to Support for quality assurance.

The entire architecture is orchestrated with a series of Airflow tasks that ingest, process, and reroute incoming transaction data every twelve hours.

Research & Evaluation Process

While only one component of the final Auto Merchant Cat system technically uses machine learning, training, validation, and testing principles of ML were applicable throughout the process. For example, we used “training” / validation sets to iterate on and/or determine the acceptance criteria to use for each component in Auto Merchant Cat. Similarly, “training” sets were used to determine how to construct and improve search strings for Google Places and how to obtain the best performance from Mechanical Turk workers. Lastly, a separate test set was used to estimate performance for the final system and ongoing QA to ensure that the production system continues to perform at the expected level. These training and test sets were obtained from the merchants that Support manually categorized.

Seaweed is the reason why Xing Xiong of the Brex Data Science team can’t find his remote. He’ll give it to anyone who lets him sit on their lap.

Using Google Places to Identify Merchants

Google Places is essentially an API for a Google Maps search. Auto Merchant Cat first cleanses the merchant descriptor provided by the network (e.g. MasterCard) by removing references to card processors (e.g. TST*, SQ*, WPY*, etc.). It then queries Google Places with the cleansed merchant descriptor concatenated with the city provided by the network. Google Places returns a list of places ordered by relevance according to Google, and Auto Merchant Cat eliminates any place from that list whose name is not similar enough to the merchant descriptor or whose city differs from the network-provided city. If no places remain, then the merchant is sent to Mechanical Turk for identification. Otherwise, the most relevant remaining place is selected as a match and the place type tags are used by the Merchant Cat classifier.

Using Machine Learning to Classify Merchants

As shown in Figure 1, we have two merchant cat classifiers in place. Classifier 1 classifies merchants identified by Google Places, while classifier 2 takes input from Mechanical Turk. Both classifiers take these two features:

Identified merchant name: either by Google Places or Mechanical Turk
MCC: Merchant Category Code, a four digit number provided by the network used to classify a business (often unreliable)

While classifier 1 takes an additional input returned by Google Places — a list of categories (Place Types) in place details.

Feature Preprocessing

Google Place Types

As the place types output returned by Google Places is a list of categories, we use a multi-label binarizer to encode the feature.

Merchant Names

We tested two ways to encode merchant names:

Count vectorizer (token counts): a representation of text that describes the occurrence of words within a document
Transfer learning (Sentence embedding): we used InferSent (Paper: GloVe/fastText embedding + Bi-LSTM + max-pooling), which is a pre-trained sentence embedding model developed by Facebook. The potential benefits of using pre-trained sentence embedding vs. word counts are: 1) It captures the similar meaning of synonyms that have never or rarely been seen in training data; 2) As a sequence model based sentence encoder it also retains the sequence information in a sentence/phrase (e.g., the focus in a name like “Civil Eats Daily” should be on “Daily” which indicates a newspaper/magazine instead of “Eats” which more likely indicates a restaurants)

MCC

As MCC code is a categorical feature, it’s processed with one-hot encoding.

Model Training, Selection, and Acceptance Criteria

Considering the multi-class classification task at hand, we use accuracy and weighted average precision as the performance metrics for model selection.

We trained two types of models: random forest and multi-class logistic regression with L2 penalty. With all the proposed feature preprocessing approaches and model structures, the winner was the multi-class logistic regression with L2 penalty, with InferSent encoding the identified merchant names. Both models achieved decent overall accuracy with above 75% for classifier 1 and 70% for classifier 2.

Lilac thinks she can trick Peter Gross, head of Credit Science at Brex, into paying for a teddy bear that’s “on sale” for $104.78.

To ensure the accuracy of category predicted by our classifiers, we set probability thresholds both on overall and category level and only take predictions with probability above the threshold. For example, for each predicted category, we find the probability threshold that makes category precision > desired precision for all accepted predictions, aka predicted probability >= probability threshold.

In order to take into account any data shift, models are automatically retrained on a monthly basis with latest data, and all inferences are based on the most recently trained model. This recurring training is also triggered with an Airflow DAG.

Scalably Using Real People When Models Fail

For anyone who may be unfamiliar, Mechanical Turk is a crowdsourcing marketplace managed by AWS that makes it easier for individuals and businesses to outsource their processes and jobs to a distributed workforce who can perform these tasks virtually.

In our automated Merchant Cat system, we use Mechanical Turk for both merchant identification and categorization when Google Places or classifiers fail.

To ensure the quality of Mechanical Turk responses, we have a few controls in place. First, we provide detailed instructions on how to perform the tasks, including how to handle certain edge cases, and we also enforce specific tests and qualifications. Workers must first pass identification and/or classification tests that we designed in order to work on the respective tasks.

Over time, we observed performance of individual workers based on match rates and QA and issued a custom Brex qualification to the best workers. This qualification is required to continue working on the task, thereby limiting the workforce to our best performing and most trusted workers.

Last but not least, each task is assigned to at least two workers, and a bonus is offered to workers whose response matches with another worker’s to encourage high quality work. If the first two responses do not agree with each other, an additional assignment will be sent. We take answers with at least two out three votes as the correct one and send failed tasks to our support team.

The Mechanical Turk component of our system has since been used to quickly deploy a large, scalable, and affordable workforce to tackle a variety of tasks at Brex. For example, when lockdowns due to COVID-19 began, we used Mechanical Turk workers to determine how some of our e-commerce customers might be affected.

While Mechanical Turk has proven to be incredibly valuable at Brex, it took multiple iterations to develop the above policy and ensure quality. Task design, instructions, incentivizing good work, and identifying strong workers are all key to ensuring that the results are good.

Mechanical Turk Results

On average, our Auto Merchant Cat system processes ~2K net new merchants every day. About 10–20% are processed by hardcoded rules. Among the rest, Google Places is able to successfully identify about half of the merchants, while the other half is sent to Mechanical Turk. Our in-house classifiers are able to categorize over 50% of the volume (~60% of the merchants identified by Google Places, and slightly below 50% of merchants identified by Mechanical Turk). Still, we have 10–20% of the merchants that fail at either identification or categorization step and hence sent to our support team for manual update. The overall error rate of the Auto Merchant Cat system has been maintained well below 5%, which mostly comes from certain ambiguous categories.

In June 2020, we also performed a third party vendor evaluation for both merchant identification and categorization. With the 5K evaluation dataset we provided, the vendor was only able to cover less than 10% compared to our 80% coverage rate at that time. The performance gap, together with some other considerations, stopped us from moving forward with them.

Conclusion

At Brex, we focus on shipping “Minimum Lovable Products” (MLPs) quickly so that we can collect the necessary data to learn, iterate, and perfect our work. Accurate merchant classification was an important driver in making our first corporate card for startups a loved product, but it wasn’t perfect on day one. Auto Merchant Cat is an example of how Data Science at Brex often takes early products and makes them scalable, robust, and accurate. As we continue to reimagine financial services through experimentation and iteration, Data Science plays an important role in taking learnings from early MLPs and shaping them into a more perfect end state.

Bryant’s cat, Valkryie, pretends to sell him toys before yelling PSYCH and taking them all for herself.

This article was co-authored by:

Bryant Chen is a staff data scientist at Brex. Prior to Brex, he was a member of the research staff at IBM. He holds a Ph.D. in computer science from UCLA and a B.A. in math and economics from the University of Chicago.

Daisy Qian is a senior data scientist at Brex. Her previous experience includes data science and quantitative research at American Express, FINRA, and Stevens Capital Management. She holds a M.S. in financial engineering from Cornell University and a B.A. in math and statistics from Peking University.

Swetha Revanur is a founding engineer of Hebbia.AI. Prior to that she was a machine learning engineer at Brex. She holds a B.S. in computer science from Stanford University.