How Agoda manages 1.5 Trillion Events per day on Kafka

Published in

Agoda Engineering & Design

12 min readJul 13, 2021

At Agoda, we send around 1.5 Trillion events per day through Apache Kafka. When we first started using Kafka back in 2015, its usage in the company was purely focused on analytical data; we used Kafka to feed data into our data lake and near real-time monitoring and alerting solutions.

Over the years, Kafka usage in the company has grown tremendously (2x growth YOY on average), culminating in 1.5 Trillion events per day in 2021. Today, more and more of our engineering teams are leveraging Kafka in their applications; our use cases for Kafka have expanded to not only be a pipeline for analytical data but to include being used an asynchronous API, as a way to replicate any data across our on-premise data centers as well as becoming an integral part of feeding and serving data to and from our Machine Learning pipelines.

With the scale and rapid Kafka usage growth, particular challenges arose that had to be tackled. In this blog post, we’ll attempt to outline our journey over the years and explain some of the key approaches we took for our ever-growing Kafka infrastructure.

1. Back to Basics: Simplifying how Developers send data to Kafka

Back in 2015, Agoda decided to employ an architecture of logging data events and separately forwarding the data into Kafka. The idea was simple, a client library (used by the development teams) would write events to disk, and a separate daemon process (called the Forwarder), would read the events on the disk and “forward” it to Kafka.

The client library was responsible for

handling file rotation
managing write file locations

The Forwarder, on the other hand, was responsible for

reading the files and retrieving the events + metadata (e.g., Kafka topic, the payload type, etc.)
sending the events to Kafka
tracking file offsets of what was sent
managing the deletion of completed files

The log to disk and then forward to Kafka (let’s call it 2-Step logging architecture) served several benefits, the first of which is to separate operational concerns away from dev teams. The lightweight Forwarder would be deployed (as a daemon service) across every node in our infrastructure and directly managed by the Kafka team.

Kafka team could perform operational tasks such as

Dynamically configure the Forwarder remotely to route events/topics to a different Kafka Cluster or block it without impacting applications or requiring changes from dev teams
Configure optimizations e.g., Kafka batching, compression, linger settings without dev teams intervention
Roll out upgrades to Forwarders independently of dev teams. e.g. we can roll out Kafka Client fixes without needing to chase teams to upgrade lib dependencies
Upgrade Kafka without needing to chase the bulk of Kafka Producers (e.g. if Kafka client protocol changed) by way of upgrading the Forwarder

Other benefits of the architecture also include

Simplified API for producers who send events without requiring any Kafka knowledge or dependency (the client lib is reduced to a simple disk logger)
Enforcing serialization standards by embedding serialization logic into our client libs. E.g., our client libs can enforce AVRO serialization of ALL events by working in tandem with a SchemaRegistry
Writing to disks adds a layer of resiliency, for example, in case of connectivity or Kafka issues

Tradeoffs Taken

With our approach, the client library has a hard dependency on the Forwarder process to send data to Kafka. This means that the Kafka team has to ensure that every node in Agoda has a Forwarder process running healthily and they are tasked with taking care of its deployment, upgrading, monitoring, and SLAs. Although this results in a more complex architecture, the operational complexity can be easily mitigated with proper monitoring and automated tooling.

At its core, the main tradeoff taken is increased latency for better resiliency and flexibility; our 2-Step logging architecture is ONLY used for sending analytics data. Using our approach, we are able to achieve a 99-percentile latency of 10s; measured end to end from application client writing to disk, Forwarder sending to Kafka, and data being ready for consumption. We think the 10s latency is acceptable and sufficient for the bulk of analytics workloads. Depending on the topic/cluster use case, we can further tune the Forwarders to achieve lower latencies of 1s at 99-percentile (i.e., utilizing smaller batches but lower overall throughput).

What if we need lower latency than 1s?

But what about critical and time-sensitive use cases that require sub-second latency? Such applications will still write to Kafka via Kafka Client Libs from their applications bypassing the 2-Step logging architecture.

We believe that this gives a nice balance of

the simplicity of the 2-Step logging architecture for the bulk (85%) of Agoda’s developers who only want to pass on some analytical data to Kafka and do not have sub-second latency requirements (delays of up to 10s is acceptable)
the flexibility to still produce to Kafka directly for specific but less common use cases that require sub-second latency

2. Thinking about Kafka Cluster Layouts: Multiple Smaller Kafka Clusters instead of 1 Large Kafka Cluster per Data Center

Early on in our Kafka journey, we also decided to split our Kafka Clusters based on use cases instead of having a single Kafka Cluster per data center. There were two main reasons for this:

To isolate any potential issues and their impact on a single Kafka Cluster
Flexibility to setup and configure clusters differently e.g., clusters could have different hardware specs and configurations depending on their use case

Smaller Kafka Clusters based on use cases are easier to manage

This strategy, coupled with the 2-Step logging architecture (explained in the previous section), enabled us to route Kafka events to a different Kafka Cluster transparently, without any change from the producer application via the Forwarder. Though changes to any existing consumer teams would still be required, such routing changes do not occur often. The per-topic nature of such routing configs also usually means that only a handful of teams will be impacted by such changes.

2a. Don’t cheap out on Zookeepers

For Zookeeper, we provisioned dedicated physical nodes (not co-located with our Kafka nodes) using dedicated SSD disks for ZK logging and snapshot data directories, respectively. Given that Zookeeper is critical to a healthy Kafka Cluster operation, this setup was done to isolate the Kafka and Zookeeper nodes. We wanted to preemptively mitigate any potential issues between Kafka and Zookeeper that may arise down the line. This setup has worked out well thus far; in the future, the plan is to migrate to newer Kafka Versions that remove Zookeeper as a dependency entirely.

3. Building Confidence: Kafka Monitoring and Auditing

“getting Kafka Broker stats wasn’t enough, we needed developers and stakeholders to trust that the data they produce/consume is complete, reliable, accurate, and timely”

With the Kafka layout and primary data delivery mechanism dealt with, the next thing we turned our attention to was auditing and monitoring of the Kafka Pipeline.

For monitoring, we gathered stats via JMXTrans piped into Graphite as a backend time-series data store and visualized the metrics using Grafana. However, we also understood that just getting the Kafka Broker stats wasn’t enough, we needed developers and stakeholders to trust that the data they produce/consume is complete, reliable, accurate, and timely. Inspired by KAFKA-260 we went about adding Kafka Auditing to our pipeline. A separate thread would run in the background on our client libs (from 2-Step Logging Architecture) asynchronously aggregating message counts across time buckets to generate audits.

Generating Audits throughout the Pipeline

These generated audits were then piped into a different Kafka Cluster spawned solely to store audit information across our many Kafka Clusters. We re-used the 2 Step Logging Architecture (explained in the first section) as the delivery mechanism; by adding a different Kafka endpoint for audit events. The audit info ultimately ends up in ElasticSearch, allowing the Kafka team to monitor and set up alerts on the pipeline's health. Having the data in ElasticSearch also made it much easier for the Kafka team to drill down and pinpoint problems whenever there were issues.

Grafana Dashboards for High-Level Pipeline Monitoring based on Audits

Drill down to Source Host Granularity to Identify Issues in Pipeline

At any point in time, we can be sure of how many events were sent across different pipeline stages, giving the Kafka team a high-level overview of the Accuracy and Latency of the Kafka Pipeline. Together with our Kafka Broker Stats (graphite / grafana), the audits provide us a comprehensive view of the overall health of our Kafka Pipeline / Clusters.

4. Keeping up with growth: Knowing when to Scale Kafka Clusters

Our next challenge was to figure out when ANY specific Kafka Cluster required expansion i.e., adding more nodes. Naively, we started by using the percentage of the available disk as the key metric to determine if a cluster required expansion. We found out quickly that this was a bad idea since Kafka retention can be dynamically configured per topic, disk used/capacity metrics did not accurately reflect Kafka Cluster Capacity.

Since many factors can contribute to the overall available capacity of a Kafka Cluster like CPU, network, disk, total partition count. In our second iteration, we wanted a more holistic measure of Cluster Capacity that took multiple resource metrics into account. Each of the resource metrics we wanted to measure would be compared against an upper limit, defined as the max limit we are comfortable having the cluster run at. This would be recorded as the current resource metric / upper limit and represented as a percentage. We would then take the maximum of all the percentages as the overall capacity of the cluster. This simplified capacity down to a single number, and it signifies the dominant resource constraint on available capacity at that point in time.

Capacity is calculated based on the dominant cluster resource/limit at that time

Key capacity metrics we monitor in our Kafka Clusters:

Per Cluster
* Disk Storage Used
* Network Utilization
* CPU Utilization
* Total Partitions per cluster
* Request Handler Average Idle Percentage

By setting alert thresholds based on the derived overall capacity percentage per cluster, the team can look into the capacity constraints and decide if there are issues that warrant further investigations or if it’s time to expand the cluster physically. Subsequently, after adopting this way of calculating cluster capacity, we found that the capacity numbers are also useful when scaling down clusters.

5. Ensuring Healthy Data Growth: Attributing cost back to teams

As data continued to grow over the years, we reached a point about 1+ years back where we were starting to ask ourselves how much of the data being sent was needed. The perception was beginning to grow that the Data Platform (including the Kafka Pipeline and our Data Lake) had turned into a dumping ground. Much of the data being sent/stored wasn’t generating value.

To tackle this situation, we took a page out from the playbook of Cloud Providers: Put a monetary dollar value on usage and attribute it back to the users. The core concept was simple: have dev teams be responsible for what they use, in our case Kafka usage, and to manage and be accountable for their cost incurred.

For Kafka, generating a monetary value for teams’ usage meant picking a suitable metric to measure and allocate cost against. For simplicity, we ended up picking bytes sent. We deliberately chose to ignore consumer usage as we didn’t want to penalize teams consuming data since the whole point about the democratization of data was to allow people to use the data. Instead, we think that the onus is on teams producing data to justify sending/storing it.

At Agoda, every Kafka topic has an owner team assigned as part of the topic creation process, making the entire cost attribution process much easier to implement. To determine a cost per byte, we took the total cost of our Kafka Clusters together with the derived Cluster Capacity Numbers (from Knowing when to Scale Kafka Clusters section above) to extrapolate the estimated 100% capacity and what that would entail in terms of total bytes per day. From there, we can derive an estimated cost per byte for use in our cost attribution models.

Attributing Costs back to Teams based on a derived cost per byte

When rolled out, we found the ability to attribute a monetary value to data sent by teams to be extremely powerful in incentivizing teams to evaluate what they need/do not need. Overall, it transformed teams’ mindsets about cost management and encouraged them to be proactive at managing their cost and be good stewards of company resources. Eventually, this evolved to become a more significant Agoda-wide effort to attribute costs for any system back to any dev teams but that is perhaps a blog topic for another time.

6. Rite of Passage: Authentication and ACLs

There comes a time with any evolving Kafka Cluster in production that there will be a need to add Kafka Authentication and ACLs (assuming you started without Authentication and ACLs).

At Agoda, when we first started, our clusters were initially used solely for application telemetry data, and Authentication and ACLs on Kafka weren’t required at that time. However, as our Kafka usage boomed over time we became concerned about our inability to identify and deal with users abusing and negatively impacting Kafka Cluster performance. To clarify, our 2-step-logging architecture (see first section above) provided us some safeguards since we can block on the Forwarder if there were a sudden flood of events. Still, as Kafka usage grew, there were more and more use-cases where teams needed to connect to Kafka directly (bypassing the Forwarder). Additionally, new requirements were emerging where we needed to restrict user access to certain topics.

Thankfully, implementing Authentication and ACLs in Kafka is well documented, and it’s just a matter of rolling it out. This meant coordinating with the many dev teams in Agoda for a migration and rollout plan. Not technically challenging, it just requires time, effort, and patience.

7. Not forgetting Operational Scalability: Automated Tooling

Lastly, I wanted to touch on the topic of automated tooling briefly. As we manage more and more Kafka Clusters, we knew we needed better tools to aid us and ensure that the operational processes remain scalable.

Currently, in Agoda, we use a mix of open-source Kafka tools like Cruise Control, CMAK, some in-house built tools for Binary Deployments, Broker config propagation, Kafka Broker rolling restarts, and UIs for managing ACLs to name a few. I don’t think we will ever stop being on the constant lookout for new and better tools to add to our arsenal, and it’s part and parcel of managing such systems at scale.

Conclusion: So what’s the deal?

Apache Kafka is a highly scalable, robust piece of software that is, by and large, relatively easy to manage. We find that the main challenges at scale have less to do with Kafka Software per se, but more around other aspects like:

separating operational concerns away from dev teams
thinking about your Kafka Clusters layout (e.g., how clusters are segregated and how data is sent)
knowing when to scale up/down a Kafka Cluster
adding monitoring and auditing as well as giving dev teams visibility into their usage and cost
having safeguards to identify and isolate possible bad actors in the pipeline quickly

What’s Next? Improving on the Forwarder

One criticism of our 2 Step Logging Architecture was at run time the client could not get an acknowledgment if Kafka had received the message (because all the client did was write to disk). As of writing, we are revisiting the Forwarder architecture with plans to incorporate an endpoint that will give the client lib a response after Kafka has acknowledged the message.

In conclusion, we’ve learned a lot about expanding our Kafka Clusters over the years. Our Kafka Journey will continually evolve as we support Agoda’s ever-growing needs.

If you are interested in coming along this journey with us, check out Agoda’s Careers Page.

DISCLAIMER: The approaches listed in this blog post ARE tailored to Agoda’s specific requirements, and we are definitely NOT saying that is what you should do; like many other things in Big Data, do not follow blindly. Experiment and do what makes sense for your needs.