How Agoda manages 1.8 trillion Events per day on Kafka

Agoda Engineering
Agoda Engineering & Design
12 min readSep 18, 2023

by Napong Thammanochit

At Agoda, we send around 1.8 trillion events per day through Apache Kafka. We first started using Kafka in 2015 to manage analytical data, feed data into our data lake, and enable near real-time monitoring and alerting solutions.

Over the years, Kafka usage at Agoda has grown tremendously (2x growth YOY on average), reaching 1.8 trillion in 2023. Now, more of our engineering teams are using Kafka in their applications. Our use cases for Kafka have broadened beyond serving as an analytical data pipeline. It now functions as an asynchronous API, facilitating data replication across our on-premise data centers and also playing a crucial role in feeding and serving data to and from our Machine Learning pipelines.

With the scale and rapid Kafka usage growth, particular challenges arose that had to be tackled. In this blog post, we’ll attempt to outline our journey over the years and explain some of the key approaches we took for our ever-growing Kafka infrastructure.

1. Back to Basics: Simplifying how Developers send data to Kafka

In 2015, we decided to employ an architecture of logging data events and separately forwarding the data into Kafka. The idea was simple: a client library (used by the development teams) would write events to disk, and a separate daemon process (called the Forwarder) would read the events on the disk and “forward” it to Kafka.

The client library was responsible for

  1. handling file rotation.
  2. managing write file locations.

The Forwarder, on the other hand, was responsible for

  1. reading the files and retrieving the events + metadata (e.g., Kafka topic, the payload type, etc.)
  2. sending the events to Kafka
  3. tracking file offsets of what was sent
  4. managing the deletion of completed files.
2-Step Logging Architecture

The log to disk and then forward to Kafka (let’s call it 2-Step logging architecture) served several benefits, the first of which is to separate operational concerns away from dev teams. The lightweight Forwarder would be deployed (as a daemon service) across every node in our infrastructure and directly managed by the Kafka team.

Kafka team could perform operational tasks such as

  • Dynamically configure the Forwarder remotely to route events/topics to a different Kafka Cluster or block it without impacting applications or requiring changes from dev teams.
  • Configure optimizations, e.g., Kafka batching, compression, and linger settings, without dev teams' intervention.
  • Roll out upgrades to Forwarders independently of dev teams. e.g., we can roll out Kafka Client fixes without needing to chase teams to upgrade lib dependencies.
  • Upgrade Kafka without needing to chase the bulk of Kafka Producers (e.g., if Kafka client protocol changed) by upgrading the Forwarder.
Separation of operational concerns

Other benefits of the architecture also include

  • Simplified API for producers who send events without requiring any Kafka knowledge or dependency (the client lib is reduced to a simple disk logger).
  • Enforcing serialization standards by embedding serialization logic into our client libs. e.g., our client libs can enforce AVRO serialization of ALL events by working in tandem with a Schema Registry.
  • Writing to disks adds a layer of resiliency, for example, in case of connectivity or Kafka issues.

Tradeoffs Taken

With our approach, the client library has a hard dependency on the Forwarder process to send data to Kafka. This means that the Kafka team has to ensure that every node in Agoda has a Forwarder process running healthily, and they are tasked with taking care of its deployment, upgrading, monitoring, and SLAs. Although this results in a more complex architecture, the operational complexity can be easily mitigated with proper monitoring and automated tooling.

At its core, the main tradeoff is increased latency for better resiliency and flexibility; our 2-step logging architecture is ONLY used for sending analytics data. Using our approach, we can achieve a 99-percentile latency of 10s, measured end to end from application client writing to disk, Forwarder sending to Kafka, and data being ready for consumption. We think the 10s latency is acceptable and sufficient for most analytics workloads. Depending on the topic/cluster use case, we can further tune the Forwarders to achieve lower latencies of 1s at 99-percentile (i.e., utilizing smaller batches but lower overall throughput).

What if we need lower latency than 1s?

But what about critical and time-sensitive use cases that require sub-second latency? Such applications will still write to Kafka via Kafka Client Libs from their applications, bypassing the 2-step logging architecture.

We believe that this gives a nice balance of

  • The simplicity of the 2-Step logging architecture for the bulk (85%) of Agoda’s developers who only want to pass on some analytical data to Kafka and do not have sub-second latency requirements (delays of up to 10s are acceptable).
  • The flexibility to produce to Kafka directly for specific but less common use cases requiring sub-second latency.

2. Thinking about Kafka Cluster Layouts: Multiple Smaller Kafka Clusters instead of 1 Large Kafka Cluster per Data Center

Early in our Kafka journey, we also decided to split our Kafka Clusters based on use cases instead of having a single Kafka Cluster per data center. There were two main reasons for this:

  • To isolate any potential issues and their impact on a single Kafka Cluster
  • Flexibility to set up and configure clusters differently, e.g., clusters could have different hardware specs and configurations depending on their use case.
Smaller Kafka Clusters based on use cases are easier to manage.

This strategy and the 2-Step logging architecture (explained in the previous section) enabled us to route Kafka events to a different Kafka Cluster transparently, without any change from the producer application via the Forwarder. Though changes to any existing consumer teams would still be required, such routing changes do not occur often. The per-topic nature of such routing configs also usually means that only a handful of teams will be impacted by such changes.

2a. Don’t cheap out on Zookeepers

For Zookeeper, we provisioned dedicated physical nodes (not co-located with our Kafka nodes) using dedicated SSD disks for ZK logging and snapshot data directories, respectively. Given that Zookeeper is critical to a healthy Kafka Cluster operation, this setup was done to isolate the Kafka and Zookeeper nodes. We wanted to preemptively mitigate any potential issues between Kafka and Zookeeper that may arise down the line. This setup has worked out well thus far; in the future, the plan is to migrate to newer Kafka Versions that remove Zookeeper as a dependency entirely.

3. Building Confidence: Kafka Monitoring and Auditing

“Getting Kafka Broker stats wasn’t enough. We needed developers and stakeholders to trust that the data they produce/consume is complete, reliable, accurate, and timely.”

With the Kafka layout and primary data delivery mechanism dealt with, the next thing we turned our attention to was auditing and monitoring the Kafka Pipeline.

For monitoring, we gathered stats via JMXTrans piped into Graphite as a backend time-series data store and visualized the metrics using Grafana. However, we also understood that just getting the Kafka Broker stats wasn’t enough. We needed developers and stakeholders to trust that the data they produce/consume is complete, reliable, accurate, and timely. Inspired by KAFKA-260, we went about adding Kafka Auditing to our pipeline. A separate thread would run in the background on our client libs (from 2-Step Logging Architecture), asynchronously aggregating message counts across time buckets to generate audits.

Getting Audit Counts
Generating Audits throughout the Pipeline

These generated audits were then piped into a different Kafka Cluster spawned solely to store audit information across our many Kafka Clusters. We re-used the 2 Step Logging Architecture (explained in the first section) as the delivery mechanism; by adding a different Kafka endpoint for audit events.

The audit info ultimately ends up in Whitefalcon (internal near real-time analytics platform) and Hadoop, allowing the Kafka team to monitor and set up alerts on the pipeline’s health using Grafana and SuperSet. Having the data in Whitefalcon also made it much easier for the Kafka team to drill down and pinpoint problems whenever there were issues.

Grafana Dashboards for High-Level Pipeline Monitoring based on Audits
SuperSet dashboard showing Kafka SLO based on Audits

At any point in time, we can be sure of how many events were sent across different pipeline stages, giving the Kafka team a high-level overview of the Accuracy and Latency of the Kafka Pipeline. Together with our Kafka Broker Stats (graphite / grafana), the audits provide us with a comprehensive view of the overall health of our Kafka Pipeline / Clusters.

4. Keeping up with growth: Knowing when to Scale Kafka Clusters

Our next challenge was to figure out when ANY specific Kafka Cluster required expansion i.e., adding more nodes. Naively, we started by using the percentage of the available disk as the key metric to determine if a cluster required expansion. We found out quickly that this was a bad idea since Kafka retention can be dynamically configured per topic, and disk used/capacity metrics did not accurately reflect Kafka Cluster Capacity.

Since many factors can contribute to the overall available capacity of a Kafka Cluster, like CPU, network, disk, and total partition count. In our second iteration, we wanted a more holistic measure of Cluster Capacity that took multiple resource metrics into account.

Each of the resource metrics we wanted to measure would be compared against an upper limit, defined as the max limit we are comfortable having the cluster run at. This would be recorded as the current resource metric/upper limit and represented as a percentage. We would then take the maximum of all the percentages as the overall capacity of the cluster. This simplified capacity down to a single number and it signifies the dominant resource constraint on available capacity at that point in time.

Capacity is calculated based on the dominant cluster resource/limit at that time.

Key capacity metrics we monitor in our Kafka Clusters:

Per Cluster
* Disk Storage Used
* Network Utilization
* CPU Utilization
* Total Partitions per cluster
* Request Handler Average Idle Percentage

By setting alert thresholds based on the derived overall capacity percentage per cluster, the team can look into the capacity constraints and decide if there are issues that warrant further investigations or if it’s time to expand the cluster physically. Subsequently, after adopting this way of calculating cluster capacity, we found that the capacity numbers are also useful when scaling down clusters.

5. Ensuring Healthy Data Growth: Attributing cost back to teams

As data continued to grow over the years, we reached a point about 1+ years back where we were starting to ask ourselves how much of the data being sent was needed. The perception was beginning to grow that the Data Platform (including the Kafka Pipeline and our Data Lake) had turned into a dumping ground. Much of the data being sent/stored wasn’t generating value.

To tackle this situation, we took a page out from the playbook of Cloud Providers: Put a monetary dollar value on usage and attribute it back to the users. The core concept was simple: have dev teams be responsible for what they use, in our case, Kafka usage, and to manage and be accountable for the cost incurred.

For Kafka, generating a monetary value for teams’ usage meant picking a suitable metric to measure and allocate cost against. For simplicity, we ended up picking bytes sent. We deliberately chose to ignore consumer usage as we didn’t want to penalize teams consuming data since the whole point of the democratization of data was to allow people to use the data. Instead, we think that the onus is on teams producing data to justify sending/storing it.

At Agoda, every Kafka topic has an owner team assigned as part of the topic creation process, making the entire cost attribution process much easier to implement. To determine a cost per byte, we took the total cost of our Kafka Clusters together with the derived Cluster Capacity Numbers (from Knowing When to Scale Kafka Clusters section above) to extrapolate the estimated 100% capacity and what that would entail in terms of total bytes per day. From there, we can derive an estimated cost per byte for use in our cost attribution models.

Attributing Costs back to Teams based on a derived cost per byte.

When rolled out, we found the ability to attribute a monetary value to data sent by teams to be extremely powerful in incentivizing teams to evaluate what they need/do not need. Overall, it transformed teams’ mindsets about cost management and encouraged them to be proactive at managing their cost and be good stewards of company resources. Eventually, this evolved to become a more significant Agoda-wide effort to attribute costs for any system back to any dev teams, but that is perhaps a blog topic for another time.

6. Rite of Passage: Authentication and ACLs

There comes a time with any evolving Kafka Cluster in production that there will be a need to add Kafka Authentication and ACLs (assuming you started without Authentication and ACLs).

At Agoda, when we first started, our clusters were initially used solely for application telemetry data, and Authentication and ACLs on Kafka weren’t required at that time. However, as our Kafka usage boomed over time, we became concerned about our inability to identify and deal with users abusing and negatively impacting Kafka Cluster performance.

To clarify, our 2-step-logging architecture (see first section above) provided us some safeguards since we could block the Forwarder if there were a sudden flood of events. Still, as Kafka usage grew, there were more and more use cases where teams needed to connect to Kafka directly (bypassing the Forwarder). Additionally, new requirements were emerging where we needed to restrict user access to certain topics.

Thankfully, implementing Authentication and ACLs in Kafka is well documented. We gathered information in the documents combined with our company-specific requirements. We completed and released our Kafka Authentication & Authorization system in 2021. The additional point is we added extra components for generating and assigning credentials to this authentication system and let teams, by self-service, request Kafka credentials and ACL.

Additional components for Authentication & Authorization

With this Authentication and authorization system,

  • The Kafka team is now able to manage users, set expiring time for credentials, block and allow users to access specific topics, etc. Also, it will make it simpler for the Kafka team to prevent sensitive data leaks by blocking people from accessing secured topics by ACL.
  • Users can ensure their topics are secured and can be read and written only by specific teams.

Next, we are testing for certificate authentication and encrypted data over SSL in Kafka to make our Kafka more secure. This might reduce overall Kafka performance, but we think it is worth trying.

7. Not forgetting Operational Scalability: Automated Tooling

Lastly, I wanted to touch on the topic of automated tooling briefly. As we managed more and more Kafka Clusters, we knew we needed better tools to aid us and ensure that the operational processes remained scalable.

Currently, in Agoda, we use a mix of open-source Kafka tools like Cruise Control, Kafka UI, some in-house built tools for Binary Deployments, Broker config propagation, Kafka Broker rolling restarts, and UIs for managing ACLs, to name a few. I don’t think we will ever stop being on the constant lookout for new and better tools to add to our arsenal, and its part and parcel of managing such systems at scale.

Conclusion: So what’s the deal?

Apache Kafka is a highly scalable, robust piece of software that is, by and large, relatively easy to manage. We find that the main challenges at scale have less to do with Kafka Software per se but more around other aspects like:

  • separating operational concerns away from dev teams.
  • thinking about your Kafka Clusters layout (e.g., how clusters are segregated and how data is sent).
  • knowing when to scale up/down a Kafka Cluster.
  • adding monitoring and auditing, as well as giving dev teams visibility into their usage and cost.
  • having safeguards to identify and isolate possible bad actors in the pipeline quickly.

What’s Next? Improving on the Forwarder

One criticism of our 2 Step Logging Architecture was at run time, the client could not get an acknowledgment if Kafka had received the message (because all the client did was write to disk). As of writing, we are revisiting the Forwarder architecture with plans to incorporate an endpoint that will give the client lib a response after Kafka has acknowledged the message.

In conclusion, we’ve learned a lot about expanding our Kafka Clusters over the years. Our Kafka Journey will continually evolve as we support Agoda’s ever-growing needs.

DISCLAIMER: The approaches listed in this blog post ARE tailored to Agoda’s specific requirements, and we are definitely NOT saying that is what you should do; like many other things in Big Data, do not follow blindly. Experiment and do what makes sense for your needs.

This blog was originally published in July 2021 and has since been updated to include new information and insights.

--

--

Agoda Engineering & Design
Agoda Engineering & Design

Published in Agoda Engineering & Design

Learn about how products are developed at Agoda, what is being done under the hood, from engineering to design, to provide users a seamless experience at agoda.com.

Agoda Engineering
Agoda Engineering

Written by Agoda Engineering

Learn more about how we build products at Agoda and what is being done under the hood to provide users with a seamless experience at agoda.com.