Lessons From Migrating DNS Systems
On the Network and Cloud Infrastructure team at Quantcast, we use Amazon Web Services (AWS) as part of our cloud-majority hybrid infrastructure setup. Our platform runs mainly on Amazon VPCs (Virtual Private Clouds), which process and route more than 20 petabytes of data per day. Up until recently, these VPCs used our old, in-house DNS system, Active Directory with Linux DNS resolvers. This article is about moving to Amazon’s DNS system (called Route 53) and the surprising “gotchas” that happened along the way.
It seemed like an easy transition: we just had to get the zones in Route 53 to look exactly like the zones in Active Directory. In a DNS system, a zone is a container that holds information, called resource records, about how you want to route traffic for a domain and its subdomains within the network.
We wrote a script to do a zone transfer (the standard way to share zone information in DNS) from Active Directory to a service we spun up for this purpose, since Route 53 doesn’t like talking directly to AD. This intermediary service then performed another zone transfer to Route 53, filling in the records. After this was done, the transition itself was literally a one line change in a config file for all our VPCs, to get them to start talking to Route 53 Resolver instead of our Linux resolvers. We didn’t even expect anyone at Quantcast to notice the change (besides maybe appreciating a performance bump).
What Could Go Wrong?
After the transition, to our dismay, we saw some automated jobs failing. When we investigated, it looked like the network wasn’t able to locate an internal resource — something was missing from the zones in Route 53. Our tests hadn’t caught this problem, and so we felt like it was a good idea to roll back the change. The rollback was, again, a simple config change to point the network back to our old DNS system.
As the rollback went into effect in the evening, we were shocked to see a huge slowdown across our platform. Some of our servers were running up a rapidly growing backlog, and as pages started going down, the whole platform team jumped on Slack. After a late night of debugging, when it seemed like we had tried everything, we decided to try pointing one of the affected servers back at Route 53 DNS — and its performance skyrocketed. We got our system back online by pointing all the affected servers back at Route 53. It took us a while to untangle what had happened (read on for our theory) and to finish up the transition.
Lessons Learned
1) Know the limits of both your DNS systems before trying to replicate between them
Eventually, we discovered that only one record had caused those jobs to fail, kicking off the rollback and the outage. The culprit was the sole record in a zone in AD DNS that wasn’t being pulled into Route 53. Route 53 doesn’t allow private subzones such as `test.quantcast.internal` IF `quantcast.internal` already exists as a Route 53 zone. On the other hand, AD does support this. Had we done an initial accounting of the restrictions of the systems, we might have realized where the rules differed. (Better testing to confirm that Route 53 was a perfect mirror of AD DNS could have also caught this problem.)
2) Clean up tech debt before a migration
It’s a good idea to think of a DNS migration as a chance to clean up unused or redundant records and internal resources. The zone that was creating the problem held a single CNAME record which just pointed to another domain name. It was sitting around because most of the jobs still used the old name, and no one had refactored the many internal references to it. (Fun fact: Quebecois websites are typically hosted under .qc.ca, so we moved away from our .qc naming structure internally, since we were worried about split-horizon DNS issues in the event of independence. Perhaps we can rest easier knowing .quebec now exists.)
3) Rollbacks can’t always be counted on to return a system to its previous state
Although the missing resource record was the real bug here, we ran into much bigger problems when we rolled back. One of the things that can complicate rollbacks is that the automatic control mechanisms in use by your clients may have changed because of the release. It’s a particularly pernicious form of hidden state.
In this case, our theory of what happened is that when our VPC hosts started hitting Route 53 DNS, it was answering much faster than the Legacy DNS they were used to. In return, the hosts ramped up their performance to take advantage of what Route 53 could handle. But when Route 53 was suddenly replaced by Legacy DNS after the rollback, the hosts didn’t ramp down their behavior. They kept acting as if they were interfacing with the much higher-performing Route 53 Resolver, and Legacy DNS couldn’t handle the volume of queries.
Amazon doesn’t expose much here, and there weren’t any major red flags in our server logs to go off of. But the best working understanding we have of what happened is this: client behaviors are version dependent, and as a result rollbacks can lead you to suffer from problems the previous system didn’t have because the client behaviors have changed.
4) When migrating to a new DNS system, use forwarding from your current nameservers as an initial test
With the outage behind us, we went back to test our migration to Route 53 with a new plan: we would keep the clients talking to Legacy DNS, and turn the Legacy DNS servers into forwarders that would point to Route 53. The general idea is to try to maintain control over the client behavior by shielding them with an extra layer of abstraction. With traffic still going through the Legacy DNS servers, we’d maintain much more control over the system, and rolling back this type of change would be much safer.
When you set up a forwarding test, make sure to clear the cache on the current nameservers to catch any DNS resource records that don’t exist in the new system. Otherwise, your legacy servers might have the resolved lookups saved, often for as long as several days.
Conclusion
The DNS migration seemed simple and straightforward, but if we were to do it all over again, we’d do it a little differently. Even a supposedly easy route can be full of surprises when it comes to network engineering, and we hope you can learn from us to proceed with a bit more caution and awareness of the potholes. But one of the best things we learned? Don’t panic. A sense of being in it together and a positive outlook helped us get through this outage with some takeaways to share.