Optimizing a DNS request logging pipeline with Apache Kafka
At Jamf, we are maintaining several global DNS gateways tailored to various customers’ use cases. Our DNS gateways are distributed across many cloud data centers across the globe to provide customers with the best experience and avoid adding unnecessary delays to DNS requests of customers’ devices.
One of the features these global DNS gateways provide is logging request metadata to a centralized Data Warehouse (this is used to drive threat discovery workflows and security reports, for instance for customers who have deployed Jamf Threat Defense to corporate-liable end-user devices). Most of our analytics and reports are running on top of the Vertica database, which is running in a separate cluster. As you might imagine, getting the data from the globally distributed DNS gateways to the Vertica cluster in a timely manner is quite a challenge. This was indeed a part of the infrastructure that desired some love.
Request logging pipeline before the redesign
Our DNS gateways are containerized GO applications running in Kubernetes clusters. Every DNS gateway instance is running as a Kubernetes Pod, which contained also a sidecar container of FluentD.
After processing the DNS request, the DNS gateway container pushed request data encoded as a JSON asynchronously to the FluentD container. The FluentD container then flushed the buffer into an AWS S3 bucket using FluentD S3 output plugin. The AWS S3 bucket was continuously read by our modified version of AWS Lambda Vertica Loader, which was responsible for taking files containing request data and copying them into our Vertica database. It was also responsible for keeping track of the data that had already been loaded into Vertica. The Vertica loader component quickly became a bottleneck as we started to handle more and more DNS requests and we needed to ship them to Vertica. As part of the redesign, we wanted to get rid of this component completely and replace it with something less error-prone and something with higher throughput.
For sending data into FluentD, we are using the GO client for FluentD. One of the interesting things we noticed, while profiling our DNS gateways, is that communication with FluentD is quite a big CPU hotspot and it is actually due to the JSON marshalling as can be seen on the following CPU profile screenshots. This was yet another reason to redesign this pipeline completely and get rid of this issue once and for all.
Request logging pipeline after the redesign
After a thorough investigation of possible approaches, we decided to build our new request logging pipeline on top of these three key components: Apache Kafka, Kafka Connect, and Vertica Microbatch Scheduler. With this new approach, the DNS gateway is encoding request data in an Avro format and sends it asynchronously to the dedicated Kafka topic. The Kafka topic is read by the Vertica Microbatch scheduler and Kafka Connect instance. Kafka Connect instance is responsible for backing up request data into the AWS S3 bucket in a more efficient format — Apache Parquet. Vertica Microbatch Scheduler is responsible for reading the data from the Kafka topic and writing it into Vertica.
The redesign is done — what have we achieved?
Using this approach we managed to get rid of bottlenecks on multiple layers:
- We got rid of the FluentD sidecar container, and DNS gateways are now sending DNS request data directly to Kafka using an efficient binary format instead of doing inefficient JSON serialization
- The new solution is more extensible. Request data is now flowing efficiently using a message system that supports multiple consumers to be added for various other use-cases
- Vertica Microbatch scheduler can process request data in a more efficient stream processing fashion and there is a much higher throughput of data getting into Vertica
- In case of a fatal loss of data in the Vertica cluster, data is now efficiently backed up in AWS S3 in an efficient binary format
- By removing FluentD sidecar containers from all of our DNS gateways we managed to save resources in our data centers across the world. These FluentD sidecars were often much more resource-demanding than the main DNS gateway containers. In the following images, you can see an example of how many resources (CPU, memory) we saved for one type of DNS gateway thanks to this redesign. In the image below, you can see that the CPU usage of DNS gateways pods dropped by 50% and memory usage of DNS gateways pods dropped by ~80%
Due to the careful analysis of our previous request logging pipeline using profiling toolings and monitoring systems, we were able to accurately pinpoint the bottlenecks of our previous infrastructure. During the design phase of our new solution, we had a clear goal of what we wanted to optimize and why we wanted to optimize it. Thanks to all this, our DNS gateways can now scale horizontally better than ever and we have a reliable request logging pipeline.