Keeping Private DNS Simple

Mikena Wood
Oct 6, 2015 · 5 min read

Migrating to Route 53 in under 100 lines of code

Why use private DNS?

DNS has been one of those services that the less we have to think about, the happier we are at Optimizely. It’s auxiliary to our application code, but hoists a strong dependency across our entire stack. While it’s not something that our customers may think about every time they use our applications, the smallest failure could propagate and cause outages, so it’s important to have a resilient system built. On the Optimizely backend, we use DNS for a few reasons:

Design

The biggest requirement is to run a DNS that no one on our team has to worry about (at least not often). More or less this means a reliable way to:

That’s not a long list. In fact, it’s really a small key/value store using the DNS protocol that offloads everything else to a public recursor. However, there’s a lot going on underneath those lookups and prior to Route 53 private DNS, this meant running our own DNS server inside of EC2 or sacrificing locality and calling out to a hosted service. We chose not to give up locality, and while this worked and provided us with a significant upgrade over using internal hostnames, IPs, or setting host files, there were a number of pitfalls in running our own solution.

For one, running a highly available DNS setup isn’t trivial. In our case, we’d run a set of three PowerDNS servers and rely on client-side round-robin to find one that is up. Relying on the client resulted in outages that weren’t immediately apparent when launching a new service, since not all applications keep this type of solution in mind when developing.

Also, some of the unknowns that come with not having a dedicated network engineer dug into a lot of our time. Such as tracing a case-sensitive record table, when DNS is expected to be case-insensitive. And our final nail in the coffin, when all recursed lookups were returning ‘servfail’, which is the DNS equivalent of “something went wrong” or “good luck”. Both of these resulting in many engineer hours to debug and fix. Not to mention a multitude of pages that become obtrusive to engineer’s ability to get work done and more importantly have a healthy sleep schedule.

At this point, we started looking into other options. We could migrate to BIND, which is widely used, but the time complexity of migrating seemed cumbersome. We took a look at Consul, which offers a lot more than DNS, including a key/value store, service discovery, and the ability to run Vault on top of it for secret management. In this case, Consul just did way way more than we needed, and while it was fairly seamless to set up, we found a few bugs that made us lean towards a more mature service. We’re optimistic of what Consul could provide for us in the future and the community seems to be moving in the right direction. (quick community bug fixes too!)

This left us with Route 53 to look at, and there were a lot of pros:

Outcome

Rosie loves a quiet pager

Since migrating to Route 53 three months ago, we’ve had a pretty seamless transition. Our pages have been cut down to 0 and we’ve experienced no down time due to our DNS. There are a few caveats that we’ve had to work around however:

Key Takeaways:

Engineers @ Optimizely

Stories from Optimizely's Engineering Team

Engineers @ Optimizely

Stories from Optimizely's Engineering Team

Mikena Wood

Written by

Engineers @ Optimizely

Stories from Optimizely's Engineering Team