Our path toward resilient DNS infrastructure

Ondřej Benkovský
Jamf Engineering
Published in
4 min readJan 12, 2023

--

At Jamf, we are maintaining multiple global DNS gateways which are responsible for handling the DNS traffic of millions of devices using Jamf Security products. One of the many things these DNS gateways can do is apply customer-defined policies on designated devices’ DNS requests. The Jamf Security DNS gateways are distributed across the world between many cloud data centers to minimize DNS response latencies. In peak hours we are responsible for handling hundreds of thousands of DNS requests per minute across all our data centers.

Acting as a middleman for so many devices’ DNS resolutions is quite a huge task and a big responsibility. For the customer the impact of a DNS request handling error was direct and unpleasant- without successful DNS resolution, the device is unable to access the desired resource (be it a website, file share server, etc.). This is the reason why keeping our DNS highly available and resilient has been our main focus. To achieve this we had to introduce multiple fail-over mechanisms to overcome any potential issues during DNS resolution.

Problems with public DNS resolvers

One of the early issues we had to fix, was the actual approach to the DNS resolution on our DNS gateways. In the beginning, our DNS gateways were designed as DNS forwarding servers. This means that they were forwarding all of the DNS requests to one of the configured public DNS resolvers like Google Public DNS (8.8.8.8) and Cloudflare (1.1.1.1). All of the DNS gateways had multiple public DNS resolvers configured and for each DNS request processed by the DNS gateway, the resolver was chosen randomly. We found out that the performance of these public DNS resolvers is not stable across the world and in time. We observed increased latencies in certain regions. We even observed that one week the public DNS resolver is very fast and the next week it becomes one of the slowest. Sometimes we were even actively rate-limited by these public DNS resolvers.

DNS resolution flow before the redesign

These issues forced us to reconfigure our DNS resolvers regularly based on the current performance of public DNS resolvers for the specific location. As this was quite a time-demanding process, we had to come up with a more scalable approach to the DNS resolution itself.

Example of worsening DNS request latencies on Jamf Security DNS gateways due to a bad public DNS resolver

Designing resilient DNS resolution infrastructure

What we wanted to achieve was a solution for DNS resolution that would not depend so much on third-party public DNS resolvers like Google and Cloudflare. After considering several ideas, doing a full recursive DNS resolution on our own instead of relying on third-party resolvers seemed to address our issues perfectly.

Jamf Security DNS gateways are built on top of a highly customizable DNS server written in a GO called CoreDNS. CoreDNS as such can forward DNS queries, but there is no actual recursive DNS resolution support in it, apart from the unbound plugin, which after some testing was not considered in a production-ready state.

All of our Jamf Security DNS gateways are running as microservices in Kubernetes clusters around the world. We wondered whether we could create a new component — a microservice in our cluster that would be responsible for doing recursive DNS resolutions on behalf of our DNS gateways. To select the right solution to build the DNS resolver component, we started with initial research and some PoCs. We were considering Bind9, PowerDNS, and Knot Resolver. Based on the result of our investigation, we chose Knot Resolver. Knot Resolver is a high-performance caching full DNS resolver developed by CZ.NIC. It is highly configurable, quite lightweight, and documented pretty well.

DNS resolution flow after the redesign

After the redesign, our DNS gateways are forwarding DNS resolutions to one of our instances of Knot Resolver running in the same cluster. The Knot Resolver instance is responsible for doing actual recursive DNS resolution and returning the response to the DNS gateway. If the response from the Knot Resolver is not received in time, we have a fallback to the public DNS resolver (like Google Public DNS). Also, Knot Resolver is running as a standard microservice in our Kubernetes cluster with all of the Kubernetes perks like rolling updates, configuration management, easy scaling, monitoring, and service discovery mechanisms.

Did the redesign help?

With our DNS resolution infrastructure redesign completed, we are no longer tightly coupled with third-party DNS resolvers in our Jamf Security DNS Gateways. The public DNS resolvers are now used only as a backup and our new local DNS resolver has proven to be a very performant and stable piece of our DNS infrastructure. As a bonus, we managed to save a lot of time, which was previously spent on the evaluation and reconfiguration of used public DNS Resolvers. And what about the errors caused by connection and timeout errors propagated from public DNS resolvers? The failure rate has dropped to less than 0.001% of the traffic, which is a big improvement compared to the previous solution.

Number of DNS timeout/connection errors on DNS gateways — one week

--

--