Using Real-time Log Data for Fraud Detection

Vamsi Chitters
Brex Tech Blog
Published in
10 min readApr 19, 2021

TL;DR

Brex started with manual fraud detection processes which were accurate but very operationally heavy. As we better understood the customer needs and the business continued to scale, we knew we needed to build automation into these processes. This article does not purport to share a novel approach to managing log data but aims to provide a helpful resource for other engineers tackling similar problems and insight into the engineering culture here at Brex. Brex celebrates curious minds — we work with world class tools and infrastructure (such as AWS, Snowflake, Kubernetes etc.) and are motivated by the goal of building an all-in-one financial solution that is truly empathetic to the needs of our customers.

Context

As a payments company, one of the challenges we face is transaction fraud, a kind of fraud where a bad actor sends an unauthorized payment from a customer’s account. This may happen if the bad actor manages to steal a customer’s private payment details, or worse, finds a way to gain control of their account.

Brex has made significant investments into transaction fraud prevention. From the very beginning, it has been a top priority to keep customer data and assets secure. As with most problems, we started with a manual workflow to review customer activity which was working effectively, but as we scaled, we saw an opportunity to proactively optimize this process to further reduce operational inefficiencies. We went from manually reviewing all outgoing ACH and wire transactions — a process called “affirmative release” — to automatically detecting fraudulent transactions in real-time.

This transformation was made possible by development of a new fraud-prevention system that generates risk scores for ACH and wire transactions using a model trained on streaming log data. The end-result is better fraud protection for our customers and lower operational burden for Brex, allowing us to support even more growing businesses on our platform.

Our Journey

It all started with logs.

Logs are an extremely rich data type. But it can sometimes be hard to unlock their value, especially at scale. This is for two reasons: first, they are unstructured, and second, they require a fast, reliable transport mechanism.

At Brex, our first step towards extracting value from our log data was structuring it. We sought to impose order on our loglines, which were inconsistently formatted and sometimes contained dynamic values. This is what led us to implement schema-based structured logging across our codebase.

With schema-based structured logging in place, we could make more sense of our logs at a high-level. But our work was far from over. As we know, logs come in various forms, from system-level logs to application-level logs. Our next challenge was figuring out how to extract actionable insights from this massive stream of data.

The solution was to introduce a new construct called an Activity. Activities are a special kind of log (whose structure is also defined in Protobuf) that are used to represent significant events on the Brex platform, such as login attempts, card transactions, and password changes. In essence, they are a higher-level data type for representing user-space events that is easy to perform analytics on.

With the right structure in place, we turned our attention to the transport mechanism. How do we get this data to consumers reliably while meeting strict latency requirements? This is what led us to building the Log Streaming Pipeline.

Log Streaming Pipeline

Our main goal was to make log data accessible to any team at Brex via a robust and reliable infrastructure that supports real-time streaming and in theory a wide range of backend sinks (Snowflake, S3, ElasticSearch, etc.).

More explicitly, the requirements were as follows:

  • Deliver 99.9% of logs (low lossiness) to consumers within seconds (low latency);
  • Support pluggable consumers/sinks (with the primary use case being Snowflake);
  • Store data entirely in our cloud, excluding consumers that are responsible for the data fetched;
  • Cater to our current infrastructure setup where all workloads are run in Kubernetes clusters (facilitated by Amazon EKS); and
  • Archive data for long-term storage, restrict access based on the type of log, as well as audit all logs.

The high-level design of the final architecture can be seen here:

Aggregator

The aggregator is responsible for collecting application-level logs at scale and making them available to the transport (pipeline). There were numerous options we explored, as described below.

  1. Send logs directly from an application to the transport (Kafka). This option was discarded because it involves tight coupling between applications and the logging backend, and logs would not be stored and consumed independent of the container lifecycle.
  2. Using a mutating admission controller to inject logging sidecars into a Pod (link) that collect logs and forward them to the Kafka topic. Although a valid option, we discarded this option since using a logging agent in a sidecar container can lead to more resource consumption and we lacked the need for such a level of configurability.
  3. Spin up a “Log Processor” application-level microservice that reads stdout/stderr logs (persisted for durability), ensures the logs are in the appropriate format, and publishes them to the appropriate Kafka topics. This option was discarded due to complicated setup (we would have to make sure both pieces are highly available) without offering significant benefit.
  4. Use a different custom logging driver for Docker (e.g., Gelf). We discarded this option given Kubernetes does not support custom container engine logging drivers.
  5. Leveraging proprietary technology like Grafana Loki or Splunk for our use case. These options were also discarded for a myriad of factors, ranging from cost to the lack of support for “pluggable backends” to opinionated user interfaces.

In the end, we decided to go with the option of deploying a node-level logging agent as a DaemonSet that runs on every node responsible for forwarding the logs. The advantages here are that this doesn’t require any changes to the applications running on the node, centralizes logs from multiple applications via. a single logging agent deployment and offers disk-level durability and isolation of the application from the logging infra.

After prototyping both the popular open-source options: Filebeat/Logstash and Fluentd (without going too much into the differences), we elected to go with the former given our familiarity with them in a different part of our logging stack as well as the sufficient performance characteristics they provided. That being said, we plan to re-evaluate Fluentd based on evolving throughput requirements and SLOs over time.

How does it work? Filebeat reads Docker container log files on our cluster hosts and forwards the messages to Logstash (log aggregator). Logstash then performs a series of transformations on each message, removes irrelevant fields, scrubs sensitive customer data (if present) and adds additional metadata like the service name (which come from Kubernetes annotations). After the message is cleaned up, Logstash sends it to the transport. We needed to make some custom configurations to cater to our needs but this has been working well in production.

Transport

The Transport component is responsible for propagating the data it receives from the Aggregator to downstream consumers (such as Snowflake, ElasticSearch, S3 etc.) in a reliable manner. It must be capable of satisfying our latency and durability guarantees, while also scaling at reasonable cost.

One of our main criteria was maintaining “at least once” delivery guarantees, precluding any data loss for the consumers. We evaluated the following options across various solution characteristics, operational characteristics, developer support features and cost based on future throughput estimation:

  1. Kinesis
  2. Kafka Cluster In-House
  3. Amazon MSK

Although all the options are valid and have their associated tradeoffs, we elected to go with Amazon MSK for our pipeline. Kafka/MSK were appealing due to: (1) more configuration options, (2) control over retention period, and (3) high throughput guarantees. Although maintenance cost was higher (still very reasonable at our volume), MSK offloaded a lot of the operational work. Moreover, with guidance from our Amazon MSK customer representative, we were able to fine tune the MSK configuration parameters to meet our needs for the near future.

Connectors

We’ll keep this section brief. We relied on Kafka Connect to propagate data in our MSK Kafka topic to the Snowflake cluster. There were some fun hops along this journey, including configuring the connectors (buffer sizes, converters etc.), exporting Prometheus metrics to define SLOs on the end-to-end pipeline, debugging SSL certificates issues and more!

Stress Testing

Now the system is set up, but does it work? We needed to know the Pipeline could handle bursty data patterns without dropping messages. This was especially important given our initial use-case — fraud detection — which involves making yes-no decisions based on the presence or absence of individual Activity logs. For this, we turned to stress testing our infrastructure to determine the scaling limits of the pipeline before the system behaved unexpectedly (i.e., failure of components, disk space exhausted, spike in data lossiness).

We begin by describing the setup here:

Log Producer: a Kubernetes deployment we spun up to generate a deterministic and synthetic load of loglines in order to measure what percentage of these reach the Logging Pipeline downstream consumer and how long they take respectively. A logline looked like this:

Log Consumer: a simple consumer service that used the Kafka library to poll for new events published to the MSK topic by the logging pipeline. We captured the following metrics on the consumer end:

  • Lossiness: the consumer emitted a Gauge for {# of events received so far per exp_id / max count expected}.
  • Latency: the consumer kept track of average latency (based on message published timestamp) via a Histogram metric.

We conducted various flavors of load and chaos tests:

  • This first set of tests involved only a single producer pod and “batches” of size 1, i.e., logs emitted serially;
  • This second set of tests involved only a single producer pod and “batches” of size N, i.e., parallel emission of logs;
  • The third set of tests involve multiple producer pods (1 -> 64 pods) and “batches” of size N, i.e., parallel emission of logs; and
  • The final set of tests involve chaos testing some of the components (e.g., terminate the 1 Filebeat Pod associated with the log-producer, terminate 1 Logstash pod, terminate N-1 Logstash pods etc.).

Nearly all the tests above yielded a 100% success rate, but we did notice the pipeline was lossy under certain conditions when Filebeat pods were restarted (e.g., a test where we pumped in 600K events with 1 ms delay, killing Filebeat periodically). Why was this happening?

As part of debugging this issue, when we looked at the Filebeat logs around that time frame, we noticed this message prior to the new Filebeat pod starting up:

The suspicion was this file was likely the one corresponding to the Filebeat instance that was terminated (log files are removed after a deletion). Continuing the investigation, we looked into the log rotation behavior (Kubernetes uses the JSON file log driver from Docker) to see if it was causing any issues. We identified that the Docker container logs are rotated every 10mb and up to 10 files are kept. When they do, here is what happens:

We found the bug! The Filebeat log path was only considering files ending with “.log” (thereby ignoring all the rotated files seen above). Once we fixed the configuration, we observed no lossiness when we reran the test.

This is only one example of how stress testing helped us achieve our durability goals. We don’t go into details regarding other levers (for ex. enabling persistent queues in Logstash) we played around with to determine our final configurations.

Conclusion

The Log Streaming Pipeline has transformed fraud detection at Brex. As Brex started to onboard more and more customers, we quickly recognized that the recurring ETL process that the Fraud team used to rely on was not scalable. In turn, we invested resources early to optimize this entire process and its underlying infrastructure to better protect our customers.

We ultimately succeeded in speeding up these alerts by rebuilding them on top of Activity data and streaming this data to Snowflake, our data warehouse, using the Log Streaming Pipeline. The difference was dramatic — it reduced our alerting speed to seconds. Today, we have taken this a step further by integrating the Log Streaming Pipeline with other consumers, such as our Fraud Model Engine, which generates risk scores for Activity events in real-time. These scores are used by microservices to mitigate fraud as it happens.

We envision many more use cases for the Log Streaming Pipeline, including tracking business metrics and funnel conversions, or even deriving cross-service SLOs based on logs emitted by different services. We hope you enjoyed this blog post on the internals of how this system works!

Special thanks to the Observability team (in particular Sherwood Callaway, who helped co-author and Thomas Cesare-Herriau), as well as the Cloud Infrastructure, Cash, Data, Fraud, and Trust teams for collaborating with us on this effort.

___________________________________________________________________

©2021 Brex Inc. “Brex” and the Brex logo are registered trademarks.

The Brex Mastercard® Corporate Credit Card is issued by Emigrant Bank, Member FDIC. Terms and conditions apply. See the Brex Platform Agreement for details.

Brex Treasury LLC is an affiliated, SEC-registered broker-dealer and member of FINRA and SIPC that provides Brex Cash, a program that allows customers to elect to sweep uninvested cash balances into certain money market mutual funds or FDIC-insured bank accounts at program banks. Investing in securities products involves risk, including possible loss of principal. Brex Treasury is not a bank and your Brex Cash account is not a bank account. Please see brex.com/cash for important legal disclosures.

--

--

Vamsi Chitters
Brex Tech Blog

Eng @Brex • Co-Founder @Elph • Eng @Google • M.S. CS @Stanford • B.S. EECS @Berkeley.