Making our lives easier by rewriting a global DNS gateway in Go

Jan Skrabal
Jamf Engineering
Published in
5 min readAug 18, 2022

In Jamf we maintain several DNS gateways, which are used by our customers’ devices based on different use-cases. The oldest one of them, called DNS+, was written a few years back when the use of Golang for our micro-services was still a dream of the future. All of our services were written in Java at the time and so the choice was obvious. Over the years, however, several issues started to surface and we knew it was time to give Java the boot and switch to Go, now our second most popular backend language.

Motivation

There are several reasons why developers would choose Golang over Java when developing their services such as performance, efficiency, or the much simpler (but not necessarily less powerful) concurrency. Since DNS gateway is something which sits in our edge clusters and is directly used by the end-users’ mobile phones and laptops, the promise of better performance sounded nice. However, it was not our main goal.

We were having several issues with the stability of our old DNS+ in Java. The service was taking too many resources and took ages to start. If one of our Kubernetes pods went down, it meant a higher load for the rest of the pods. The higher load in turn made the rest of the instances crash and they ended up in a downward spiral of restarts where the only remedy was manual intervention. Solving these problems would be a non-trivial effort and after considering all of our difficulties and plans for the service, a complete rewrite seemed like the best option.

Additionally, all of our other DNS gateways are written in Golang already and they use the same DNS framework (CoreDNS). That means we can extract some of the code they all use into a common library. Having the common code separated allows for easier maintenance and a smaller code base for all of the gateways.

So what is DNS+?

DNS+ is not a pure DNS server, it is rather something that resembles a DNS-over-HTTPS solution. The service accepts HTTP requests containing the actual DNS requests with some additional information. The DNS request is then handled by a chain of handlers implementing business logic as well as providing logging, reporting, etc. At the end of the handler chain, the request is sent to multiple recursive DNS servers, which handle the actual IP translation.

High-level overview of DNS+

The fact that most of the logic is separated into relatively small and self-contained chunks of code meant that the rewrite to Golang’s CoreDNS framework would be fairly simple as the framework uses a similar approach (CoreDNS handlers are called plugins). It also allowed us to easily remove the obsolete functionality we didn’t want to support anymore. The only issue we needed to resolve was that the CoreDNS framework did not support handling of HTTP requests in the format we wanted. CoreDNS does support DNS-over-HTTPS, however, the payload and URL are set and cannot be changed. Therefore we needed to create an extra handler, which starts an HTTP server and forwards the requests to the other handlers providing the actual functionality.

Performance testing

Now we get to the good stuff. To see the difference between the different implementations in practice, we prepared a simple performance test scenario in the Gatling framework and ran it on both versions of the service. We ran this scenario in our development environment. Below you can see the comparison of response times (in milliseconds) of the services handling 250 requests per second.

Comparison of response times (in milliseconds)

Okay, that looks very nice, but how does the usage look in the real environment? First, we needed to roll out the new service to the production clusters and do it with minimal impact on the end users.

Pack it and ship it

To make the rollout as safe as possible, we would first deploy the new service on a smaller data center, where we could monitor and test the changes.

The initial step was to have both services living simultaneously in the production environment with all of the traffic still going to the Java service.

After that, we would instruct our API Gateway to route the traffic to the Go implementation instead of the Java one.

The last step was to downscale the old Java implementation and delete all of its Kubernetes objects.

Rollout of the new DNS+ implementation to one of the data centers

After all of this, we would have a new service up and operational and we could monitor issues and fix any outstanding bugs. When we were reasonably sure that there were no other problems, the process was repeated for the remaining data centers.

After the rollout, we had the actual metrics collected for both the old and the new service, which would provide a much better comparison than a performance test can. The yellow line represents the new Golang service and it is compared to the Java service data from the week before.

CPU usage
Memory usage
Average request latency (in milliseconds)

As you can see, there is a huge drop in memory consumption to about 12% of the Java service. However, the CPU usage, which we thought would drop as well, shows only a negligible difference. Likewise, the average request latency is very similar and doesn’t show any noticeable difference as we saw from the performance test results.

Conclusion

Our main goal for the rewrite was to make DNS+ stable and prepared for future updates in functionality. We got rid of the problem of crash loops when one of the pods was down, cleaned up obsolete functionality, and facilitated code reuse with our common library. As a cherry on top, we lowered the memory usage dramatically, even though other metrics have not changed.

As it usually goes we did not fully realise what we did at the moment. A few months later we had to scale our gateway to handle 10 times more traffic. That would not have been possible with our Java gateway at all. The decision to switch from Java to Go paid off big time.

In memoriam

The original author of this article was Andrej Halaj, a beloved colleague and friend who is no longer with us.

Andrej Halaj (1993 – 2022)

--

--