Optimising Real-Time Fraud Detection with Debezium, Redis, Apache Spark and Apache Kafka

Published in

MeghGen

10 min readJul 17, 2024

What’s This About?

In today’s digital world, spotting fraudulent activities in real-time is super important for businesses. With millions of transactions and user interactions happening every second, old-school rule-based systems just don’t cut it anymore. This blog post dives into how we built a cutting-edge fraud detection system using Debezium, Redis, Apache Spark, and Apache Kafka, and how we tackled the key challenges encountered along the way.

A top survey and market research platform was facing big challenges with 
data integrity due to sophisticated fraud tactics. Fraudsters were using 
all sorts of tricks to mess with surveys, like creating multiple fake 
accounts, using bots to automate responses, and hiding their identities 
with proxy servers.

The platform needed a system that could

spot fraudulent users in real-time
handle large amounts of data efficiently
quickly adapt to new fraud patterns

The Challenges

1. Handling Huge Data Volumes

The fraud detection platform must manage a massive MySQL table containing around 5 billion user activity records from user_x table, with about 1,000 new records added every second. We tried improving performance by adding indexes, partitioning the tables, and optimising query logic. These efforts boosted performance by 40%, cutting query times from 10–15 seconds down to 5–10 seconds. Despite this improvement, it still falls short of our real-time needs, which require response times of under 100 milliseconds for effective fraud detection.

Real-time fraud detection also demands extremely low latency on data lookups from tables in MySQL. The user_y table is approximately 1 billion records and the user_z table is approximately 6 billion records. We need to access records from these tables for fraud detection logic to work. We tried indexing strategies and optimised queries for low-latency access. It reduced lookup times by 30%, but still averaged 10 seconds for complex queries, which was not fast enough.

2. Integrating Complex Data

The solution requires integrating data from multiple sources — MySQL (user_x table), MySQL (user_y and user_z tables), and PostgreSQL (good_u and good_u_prob model outputs).

We developed stored procedures in MySQL to handle the integration of user_x, user_y, and user_z within the MySQL ecosystem. These stored procedures were designed to perform complex queries and data transformations to facilitate fraud detection.However, as data volume and query complexity increased, the performance degraded significantly.

At 1 billion records, query times increased to approximately 15 seconds and at 2 billion records the query times increased to approximately 40 seconds. Stored procedures struggled with the large dataset sizes (e.g., 5 billion records in user_activities), leading to query times exceeding several seconds. Additionally, stored procedures did not provide a straightforward way to integrate data from PostgreSQL, which was essential for incorporating the good_u and good_u_prob models into the fraud detection logic.

3. Scalability and Performance + Costs

As the volume of user activities grows, the system needs to scale efficiently without compromising performance. Initially, databases and processing nodes were running on 8-core CPUs with 64GB RAM. We upgraded to 16-core CPUs with 128GB RAM and added additional nodes, increasing the total count from 4 to 8 nodes for processing tasks.

Query processing times improved by approximately 25% but Monthly operational costs increased from $10,000 to $25,000 due to higher hardware and maintenance costs which was not acceptable.

Architectural Solutions Explored

Going All-In with Kafka — We explored a fully Kafka-based solution using Kafka Streams for all data processing tasks. Kafka Streams provided a promising approach due to its real-time stream processing capabilities, allowing data to be processed as it arrives. This setup aimed to simplify the architecture by eliminating the need for additional processing frameworks like Apache Spark. Initial tests showed that basic transformations and filtering operations could be handled efficiently, with average processing times under 100ms. Scalability was also a key advantage, as Kafka’s distributed nature allowed the system to handle increased data volumes and ingestion rates effectively.

Despite its strengths, the fully Kafka-based approach faced limitations in :

Handling complex analytics required for sophisticated fraud detection algorithms.
Advanced computations where Kafka Streams showed insufficient support compared to Apache Spark.

2. Trying NoSQL Databases — In our effort to simplify architecture and improve scalability, we explored adopting NoSQL databases such as Cassandra and MongoDB for all data storage. We migrated our existing relational database data to the NoSQL environment, keeping the schema of the tables as it was, aiming to leverage the renowned horizontal scaling capabilities of these systems for handling large volumes of distributed data. However, this approach of maintaining the relational schema in a NoSQL context had significant implications, as we failed to optimise our data model for NoSQL-specific features and access patterns.

As a result, we encountered substantial performance challenges, particularly with fraud detection operations. Queries requiring joins across multiple tables and specific data lookups experienced significant degradation, with query times exceeding several seconds. Moreover, the complexity of our fraud detection algorithms, which demand multi-step processing and real-time updates, proved difficult to implement efficiently in the NoSQL environment. This experience highlighted the importance of carefully evaluating database choices and data modelling strategies to align with specific use case requirements, especially for complex operations like fraud detection.

3. Sticking with a Single Database — Considering consolidating all data into a single, high-performance database (e.g., PostgreSQL, MySQL) to simplify data management and improve query performance.

While a single database solution offered advantages in simplified architecture and potentially improved performance for transactional writes and fast reads, it faced limitations in handling diverse workloads. Combining high-volume transactional writes with complex analytics queries could strain the database’s resources and scalability. For example, performing complex joins and aggregations on a single database with user_activities (5 billion records), user_y (1 billion records), and user_z (6 billion records) led to significant performance bottlenecks, with query times often exceeding 5–10 seconds.

Scaling a single database horizontally to manage large datasets and concurrent user access was challenging and costly, requiring an estimated 50% increase in infrastructure and hardware investments. The need to balance between high-volume writes and fast read operations for fraud detection proved difficult, and a single database solution could not efficiently handle the diverse workload without compromising on performance and scalability.

What really Worked 👍

Our real-time fraud detection system leverages a combination of technologies to efficiently process and analyse large volumes of user data. The core of the system is built around Apache Kafka, Apache Spark, Debezium and Redis. User activity data from user_x table is continuously streamed into Kafka using Debezium. Spark Streaming then picks up this data in real-time, applying necessary transformations and combining it with additional information.

To optimise lookup times, user_y and user_z are streamed into Redis using Debezium, providing low-latency access to this frequently needed data and to load historical data into Redis we have used Spark. The system also integrates machine learning model’s data stored in PostgreSQL, such as good_u and good_u_prob, which are crucial for advanced fraud detection.

The fraud detection logic implemented in Spark analyses various factors, including user activity patterns, email domains, geographical information, and duplicate review IDs. It applies complex filters and joins to identify potentially fraudulent or suspicious users. The system checks for recently active users, filters based on user status and lifetime points, and identifies users with duplicate review IDs or other suspicious patterns.

The results of these analyses are stored in a cloud bucket. To make the results easily accessible and queryable, we use Trino (formerly known as Presto), a distributed SQL query engine, to view and analyse the data stored in the cloud bucket. Trino allows us to run complex queries on the stored results efficiently, providing insights and enabling further investigation into potential fraud cases.

Key Components Involved

Debezium

Powerful tool that monitors MySQL binary logs for any data changes. When it detects a change, it immediately captures and processes it. This ensures that the data in Redis is always up-to-date with the latest changes from MySQL.

Apache Kafka

Open-source distributed streaming system used for stream processing, real-time data pipelines, and data integration at scale.

Kafka, utilising Debezium for change data capture (CDC), streams approximately 1,000 records per second from the MySQL user activities table, with the process completing within 50 ms. This setup serves as the backbone for real-time data ingestion, ensuring that data is processed as it arrives. The Debezium MySQL connector reads the MySQL binlog to capture changes, allowing for efficient and low-latency data streaming.

Kafka’s distributed nature, combined with Debezium’s CDC capabilities, allowed the system to handle increased data volumes and ingestion rates effectively. This approach enables real-time data pipelines, facilitating immediate data availability for downstream applications and analytics. The use of Debezium also ensures that the original database schema is preserved, maintaining data integrity and consistency throughout the streaming process.

Redis

Redis is an in-memory data store used for low-latency data lookups, providing quick access to frequently needed data, which is crucial for real-time fraud detection. We chose Redis for its blazing-fast performance.

But here’s where it gets interesting — we had to decide between two Redis configurations:

Our Choice — For our real-time fraud detection system, we opted for the Single Node with Replica configuration. This decision was based on our current data volume and throughput needs. The Single Node with Replica setup provided the right balance between performance and cost-effectiveness, making it the ideal choice for our use case. It allowed us to maintain high performance with low latency data lookups, while also keeping the system simple and cost-efficient.

Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. It is used for real-time stream processing and complex computations, making it ideal for implementing sophisticated fraud detection logic.

Spark Streaming processes data in real-time from Kafka, applying necessary transformations and combining it with additional information from PostgreSQL, which stores machine learning model outputs such as good_u and good_u_prob.

Spark’s processing capabilities enabled the system to handle complex computations required for fraud detection. The system achieved sub-100ms response times for real-time fraud detection logic.

We have also used Spark to load historical data of user_y and user_z tables from MySQL to Redis for low latency look up.

The Implementation 🧑‍💼

Given that there was NO existing Debezium sink for Redis, we developed our own custom solution. The solution uses Debezium to interpret MySQL data changes and writes it to Redis. We can define various properties like connection details to MySQL and Redis, properties related to Debezium, etc. Depending on whether Redis Cluster Mode is enabled or disabled, we employ different connection strategies: Cluster Mode Enabled and Cluster Mode Disabled.

This approach allows for flexible deployment configurations, accommodating both clustered and non-clustered Redis setups, while maintaining efficient data synchronisation between MySQL and Redis. Every time MySQL records a change, our code captures and updates Redis almost instantaneously, ensuring that the data is always current.

Note: While Debezium is great for real-time changes, it processes records one at a time. Spark can process data in large batches, which is much more efficient for initial loads. Spark can utilise multiple nodes in a cluster, processing different partitions of data simultaneously. In order to load user_y and user_z data into Redis which has a combination of 7 billion records we had to develop a spark job which would load existing data into Redis.

Summary of Performance Metrics

These components and their performance metrics demonstrate the effectiveness of the integrated system in achieving real-time fraud detection with low latency and high scalability.

There we have it folks! 🙂

Our real-time fraud detection system has achieved significant milestones. By leveraging

Apache Kafka for data ingestion
Apache Spark for initial ingestion and real time fraud detection
Debezium to bring MySQL changes to Redis and Kafka
Redis for low-latency lookups

We have created a system that not only meets but exceeds the performance requirements. The system processes approximately 1,000 records per second, achieves sub-100ms response times for fraud detection logic, and maintains data consistency with real-time updates.

We believe this solution can be beneficial for several other business use cases:

Quick lookups can improve product recommendations and personalised offers in real-time as customers browse.
Reduced latency in data lookups can be crucial for high-frequency trading and market analysis.
Real-time inventory and logistics data can be processed more quickly for better decision-making.
Financial institutions can use real-time processing for transaction monitoring, risk assessment, and compliance checks.

Key Takeaways 📚

Debezium is excellent battle tested tool for real-time data synchronisation for data
While Debezium is great for real-time changes, it processes records one at a time. Spark can process data in large batches, which is much more efficient for initial loads. Spark can utilise multiple nodes in a cluster, processing different partitions of data simultaneously.
Real-time data processing is crucial for effective fraud detection. Delays in data processing can result in missed fraudulent activities.
Balancing scalability and performance is essential. While adding more resources can improve performance, it also increases costs. Efficient use of resources is key.
Integrating data from multiple sources requires careful planning to avoid inconsistencies and latency issues.
Selecting the appropriate tools and configurations, such as Redis Cluster Mode vs. Single Node with Replica, can significantly impact the system’s performance and cost.

So there you have it — our journey of taming the real-time data beast with Debezium, Redis, Kafka and Apache Spark.