Shipping 150Bn+ events every month

Trilok Jain
Hevo Data Engineering
4 min readJul 12, 2021

In the past couple of months, Hevo has seen a multi-fold upward trend in the number of customers who want to use the product. Consequently, the amount of data that gets shipped from the transactional end of their data ecosystems to the analytical ends, has also seen a corresponding increase.

In this post, we discuss how the components responsible for ingestion and processing (consumption) respond to the increase in data volume for pipelines that ship events to Data Warehouses.

Ingestion: The process of identifying and reading the updated data from a source system.

Processing (Consumption): The process of validating, transforming, mapping, and making data ready to be written to the destinations.

Keeping the two systems decoupled helps in dealing with non-symmetric read & write rates, better failure handling, and achieving efficient batching for the warehouses (which is desired). It also helps in scaling the read and write components independent of each other. We chose Kafka for asynchronous buffering.

The Infrastructure Layer

Hevo is built on AWS. There are a couple of geo-aware regional clusters that are responsible for moving data in specific geographies. Each regional cluster consists of a self-managed Kafka cluster and an application cluster.

For the application cluster, the Infra coordinator collects metrics from the application components that are horizontally scalable and responds by increasing or decreasing ec2 instances of relevant capacities. The Infra coordinator itself uses Consul to maintain the necessary configurations and thresholds. Each node that is added to the application cluster in response to the scaling needs, comes with a pre-configured working capacity defined in terms of units understandable to the relevant system.

The Ingestion Layer

Hevo’s job manager Handyman executes the ingestion jobs that are submitted to it by the Connectors. Handyman makes use of worker pools that are made available to it on ec2 nodes that come with a pre-configured number of workers depending on the instance types.

Connector: A module that defines how and when the data from a given source must be fetched

The requirements of various connector jobs may be quite different in terms of their CPU & I/O needs, run time, etc. Being a multi-tenant system, these requirements vary from customer to customer and have non-homogeneous seasonal patterns. The workers are scaled in the following two ways:

  • Local capacity enhancement: A Local Capacity Enhancer on each node tracks the CPU utilization and dynamically adds or removes workers beyond the pre-configured capacity of the node.
  • Horizontal scaling: A Delay Monitor calculates the grouped wait times of the tasks that are eligible to be run, which is then read by the Infra coordinator before it requests the cluster ec2 capacity to expand or shrink.
Task execution delays: Before and after the Local capacity enhancement
Horizontal scaling needs: Beyond local capacity enhancement

AWS Lambda was considered and discarded as a candidate for executing the ingestion jobs due to the limitations to the job running capabilities and high costs.

The Processing (Consumption) Layer

Hevo follows a multi-topic, multi-partition strategy for Kafka topics. The topics are a function of the team (tenant) configuration and destination configuration. Each consumer group is responsible for processing events from a set of team topics. At Hevo, each of such grouping of teams is known as a virtual cluster. The goals of this grouping strategy are:

  • To keep the event processing latencies minimal for each customer.
  • Be able to scale the consumer threads quickly and independently in response to the growing event loads, thus being as near real-time as possible.
  • Providing some degree of isolation to the teams.

These virtual clusters automatically adjust themselves to the change in the team dynamics. The Coordinator in the processing layer makes use of several metrics like topic lags, processing speeds, hot-spotting, etc. to come up with the capacity requirement numbers across all of the virtual clusters. The Infra Coordinator responds by adding nodes to the physical application cluster. The information on the increase in consumer capacity is then broadcasted to the cluster - the application responds to it by starting/stopping consumers for the relevant virtual cluster if they are healthy and are able to do so.

A test run to check the processing capacity at ~ 0.45B events/hour per physical cluster

Additionally, there are tools and SoPs defined for the on-call teams to cater to a few exceptional scenarios that are not addressed by the automatic scaling system, yet.

Thank you for reading the post till the end. In the days to come, we will discuss how we scale and achieve near real-time data writes to databases as destinations, some of the challenges that are specific to the database writes, and the associated cost considerations.

Please write to us at dev@hevodata.com with your comments and suggestions. If you’d like to work on some of these problems, please do check the careers page at Hevo.

--

--