Architecting Real-Time Blockchain Intelligence at TRM Labs Part 1: Evaluating Stream Processing Architectures and Engines

Vijay Shekhawat
TRM Engineering
7 min readJan 10, 2024

--

Why Real-Time Stream Processing

At TRM Labs, a leader in monitoring, detecting, and investigating crypto fraud and financial crime, we routinely ingest and analyze petabytes of blockchain data daily. To support real-time blockchain intelligence at scale, we built an in-house real-time data processing pipeline to reliably and cost-effectively process petabytes of blockchain data.

Exploring Architectures

We began by exploring various architectures, notably the Lambda and Kappa models.

Lambda Architecture

Lambda architecture is a dual data processing technique comprised of a Batch Layer and a Stream Layer (also known as Speed layer), that is capable of dealing with huge amounts of data efficiently.

  • Batch layer: New data keeps coming as a feed to the data system. Any new data stream that comes to the batch layer of the data system is computed and processed on top of a Data Lake at a predefined cadence.
  • Stream Layer: New data is computed and processed in real-time using a stream processing engine.
  • Serving Layer: The outputs from the batch layer in the form of batch views and from the stream layer in the form of near-real-time views are forwarded to the serving layer which merges the results from both layers to provide a comprehensive view of the data.
  • Pros: Robust, fault-tolerant, and suitable for complex processing tasks.
  • Cons: Complexity in maintaining two separate systems.

Kappa Architecture

Kappa Architecture is a streamlined approach where a single processing layer handles both real-time and historical data.

  • Single Processing Layer: Unlike Lambda Architecture, which uses both batch and stream processing, Kappa Architecture relies solely on stream processing. This simplification means that all data, both historical and real-time, is treated as a stream.
  • Serving Layer: After processing, the data is typically pushed to a serving layer, which might consist of databases or other storage systems that enable quick data retrieval and querying.
  • Pros: Simplified architecture, easier to maintain.
  • Cons: Less efficient for complex, multi-step processing.

We opted for the Lambda architecture due to its dual-layered processing framework. This dual approach offers a significant benefit in terms of fault tolerance. In scenarios where the stream processing pipeline encounters issues or failures, we can seamlessly fall back on the batch processing pipeline. This redundancy ensures that our data processing operations can continue without significant interruptions, maintaining the integrity and continuity of our data workflows.

Moreover, the batch layer is particularly useful and effective for tasks like data backfills. It can efficiently process and load petabytes of data, using bulk load operators like COPY.

Choosing the Right Stream Processing Engine

The next step in our journey to building the Real-Time Blockchain Stream Processing Architecture at TRM Labs was choosing the right processing engine. In our case, the decision revolved around finding the right balance between Performance, Scalability, Maintenance Overhead, Compatibility with our existing systems, and, Cost. We also considered factors like the skill set of our team and the specific requirements of our blockchain data processing needs. This comprehensive evaluation guided us to choose an engine that not only met our current requirements but also aligned with our vision for scalable and efficient real-time data processing in the domain of crypto asset compliance and risk management.

We compared various engines, including Spark Structured Streaming, Apache Beam, Apache Flink, and Kafka Streams, across several critical dimensions:

Performance

Performance in stream processing refers to the system’s ability to process high volumes of data with minimal latency. This was of prime importance given the nature of the data and stakes involved at TRM Labs. We needed a stream-processing engine that could achieve the sub-second latency of a block being mined.

Key Aspects:

  • Throughput: The amount of data the system can process within a given time frame.
  • Latency: The time taken for data to be processed and for results to be available after ingestion.
  • Efficiency: How well the system utilizes resources like CPU, memory, and network.
  • Impact: High performance ensures timely data processing, which is critical for real-time applications like fraud detection or live monitoring systems.

Scalability

Scalability is the ability of the system to handle increasing volumes of data, increasing number of data pipelines, or more complex processing demands without a significant drop in performance. This was a key requirement as TRM supports over 30 blockchains, including high TPS blockchains such as Solana, Tron, and Binance Smart Chain.

Key Aspects:

  • Horizontal vs. Vertical Scaling: Whether the system can scale by adding more machines (horizontal) or by adding resources to a single machine (vertical).
  • Elasticity: The ease with which the system can scale up or down in response to changing demands.
  • Impact: Scalable systems can accommodate growth in data volume and complexity, ensuring longevity and adaptability of the investment.

Maintenance Overhead

This refers to the effort and resources required to keep the system running smoothly. We wanted to keep this as minimal as possible to reduce spending time maintaining the system and focus on building new software.

Key Aspects:

  • Operational Complexity: Minimize operational complexity to ensure that monitoring and remediation can be handled with a simple runbook by any engineer in our follow-the-sun on-call rotation.
  • Online Migration: Ease with which the system can be updated or upgraded without significant downtime
  • Reliability: The frequency and impact of system failures or bugs.
  • Impact: Systems with lower maintenance overhead reduce the total cost of ownership and free up resources for other tasks.

Compatibility

Compatibility concerns how well the new system integrates with existing technologies and infrastructure. We use the GCP stack and needed this new processing engine to fit well with our batch-processing stack of BigQuery/Postgres/Redis/Airflow.

Key Aspects:

  • Integration with Existing Tools: Compatibility with current data sources, storage solutions, and other tools.
  • Data Formats and Protocols: Ability to handle various data formats and communication protocols.
  • Impact: High compatibility reduces integration challenges and leverages existing investments in technology.

Cost

Cost involves the total expenditure associated with implementing and operating the system.

Key Aspects:

  • Initial Setup Costs: Expenses related to acquiring the technology, including licensing fees if applicable, infrastructure setup, and, development and testing costs.
  • Operational Expenses: Ongoing costs such as hosting, maintenance, and monitoring.
  • Scalability-Related Costs: Costs associated with scaling the system in response to increased demands.
  • Impact: Understanding and optimizing cost is crucial for ensuring that the solution is financially viable and delivers a good return on investment.

Evaluating stream processing engine

Evaluating streaming engine in Context of TRM Labs stack

Decision: Apache Beam with GCP Dataflow

Explanation:

  • Performance: Apache Beam with GCP Dataflow: Exceptional performance, especially for data processing tasks within the GCP ecosystem. Dataflow optimizes and manages the execution environment, leading to enhanced performance.
  • Scalability: Apache Beam with GCP Dataflow: As a managed service it provides excellent scalability, as Dataflow seamlessly manages resource allocation and scaling, making it highly efficient for handling varying data loads.
  • Maintenance Overhead: Apache Beam with GCP Dataflow: Lower maintenance overhead due to the managed nature of GCP Dataflow. It abstracts much of the operational complexity associated with running large-scale data processing jobs.
  • Compatibility: Apache Beam with GCP Dataflow: Superior compatibility with GCP services, offering seamless integration with other cloud services and tools such as GCP monitoring. This makes it an ideal choice for teams heavily invested in the GCP stack.
  • Cost: Apache Beam with GCP Dataflow: While generally cost-effective, especially for GCP users, the costs can vary based on the scale of data processing and the specific resources used. The managed nature of Dataflow can lead to higher costs at scale, Dataflow prime mitigates that to a great extent and it also offers significant savings in terms of operational overhead and time.

Additional Considerations:

  • Flexible Execution and Portable Code: Apache Beam supports a variety of execution engines, including Apache Spark, Google Cloud Dataflow, and Apache Flink. This offers us the flexibility to choose the execution engine that best meets our needs or change it in the future with limited to no code changes.
  • Beam Unified Model: Apache Beam provides a unified model for batch and streaming data processing. This means that we can use the same Beam code to process data that is either coming in as a stream or that has already been collected into a batch.

Up Next: Architecting Real-Time Blockchain Intelligence at TRM Labs Part 2: Navigating Stream Processing Challenges

There are a lot more exciting things to unpack as we unfold our journey for building real-time processing at TRM Labs, and discuss how we navigated stream processing challenges — stay tuned in for Part 2!

We are hiring – Join us in building the future of finance!

--

--