Keeping Private DNS Simple

Migrating to Route 53 in under 100 lines of code


Why use private DNS?

DNS has been one of those services that the less we have to think about, the happier we are at Optimizely. It’s auxiliary to our application code, but hoists a strong dependency across our entire stack. While it’s not something that our customers may think about every time they use our applications, the smallest failure could propagate and cause outages, so it’s important to have a resilient system built. On the Optimizely backend, we use DNS for a few reasons:

  • Lightweight service discovery: We’re not at scale to run a full blown service discovery layer, and in the theme of “simplicity” this adds unnecessary complexity. However, CNAMEs have worked out great for finding a fairly permanent service, like a metrics endpoint.
  • Human readable names (also comical relief): Descriptive hostnames provide better identification than AWS’s provided names. We started using a custom naming scheme and have iterated on it over the years(ex. “monitoring-us-east-1b-117807-venemouspolarbear”). Now I can actually communicate an instance to someone sitting next to me. In a world of hashes, UIDs, and other chaos, this has been an invaluable change.
  • Parallel execution by like hostnames(here be dragons): Human readable names paired with Salt provides us with the ability to execute a command on a set of hosts sharing a similar name (ex. remove a file from all hosts in cluster1).

Design

The biggest requirement is to run a DNS that no one on our team has to worry about (at least not often). More or less this means a reliable way to:

  • Update and create A, CNAME, and PTR records
  • Query private DNS records
  • Recurse to public DNS for records out of our private domain
  • Propagate records quickly to infrastructure living in our AWS VPC

That’s not a long list. In fact, it’s really a small key/value store using the DNS protocol that offloads everything else to a public recursor. However, there’s a lot going on underneath those lookups and prior to Route 53 private DNS, this meant running our own DNS server inside of EC2 or sacrificing locality and calling out to a hosted service. We chose not to give up locality, and while this worked and provided us with a significant upgrade over using internal hostnames, IPs, or setting host files, there were a number of pitfalls in running our own solution.

For one, running a highly available DNS setup isn’t trivial. In our case, we’d run a set of three PowerDNS servers and rely on client-side round-robin to find one that is up. Relying on the client resulted in outages that weren’t immediately apparent when launching a new service, since not all applications keep this type of solution in mind when developing.

Also, some of the unknowns that come with not having a dedicated network engineer dug into a lot of our time. Such as tracing a case-sensitive record table, when DNS is expected to be case-insensitive. And our final nail in the coffin, when all recursed lookups were returning ‘servfail’, which is the DNS equivalent of “something went wrong” or “good luck”. Both of these resulting in many engineer hours to debug and fix. Not to mention a multitude of pages that become obtrusive to engineer’s ability to get work done and more importantly have a healthy sleep schedule.

At this point, we started looking into other options. We could migrate to BIND, which is widely used, but the time complexity of migrating seemed cumbersome. We took a look at Consul, which offers a lot more than DNS, including a key/value store, service discovery, and the ability to run Vault on top of it for secret management. In this case, Consul just did way way more than we needed, and while it was fairly seamless to set up, we found a few bugs that made us lean towards a more mature service. We’re optimistic of what Consul could provide for us in the future and the community seems to be moving in the right direction. (quick community bug fixes too!)

This left us with Route 53 to look at, and there were a lot of pros:

  • Route 53 is priced per request, and an internal network is not hitting nearly the volume of a public facing record set.
  • Their Private DNS allowed us to seamlessly change our nameserver to an internal address and Amazon does the rest for us.
  • The biggest positive of Route 53 is that we don’t need to run any infrastructure. We already run a lot of infrastructure with Chef and Ruby, and modifying the public cookbook provides us easy access to update records.
  • Migration was simple. We were able to utilize the AWS Ruby SDK, pull our EC2 tags, and create all the records in a reverse and forward zone. Then we just had flip our DHCP option set on our VPC and voila. Here’s the code, but I promise you it’s ugly and we only used it once: https://gist.github.com/mi-wood/0c53a13e4b7a2c78e2a9

Outcome

Rosie loves a quiet pager

Since migrating to Route 53 three months ago, we’ve had a pretty seamless transition. Our pages have been cut down to 0 and we’ve experienced no down time due to our DNS. There are a few caveats that we’ve had to work around however:

  • No Security Groups: Route 53 doesn’t provide security groups for the IP it provides you. This means it is only accessibly from your VPC IP space and a VPN connection won’t be able to query it. We worked around this by bringing up a bind server in the VPC that forwards all queries to the Route 53 IP. This server has a security group restricted to our VPN DNS server only.
  • Resource Contention: Since hosts are registering their own DNS entries, bringing up multiple hosts at once could result in an error updating record sets if the prior request hasn’t finished. A workaround for this is to add a retry mechanism that sleeps a little while until the other requests resolve.

Key Takeaways:

  • Simplicity has continued to be a huge winner in our ops engineering
  • This blog post took longer to write than the actual migration
  • RIP venomous polar bear