You’re probably doing DNS wrong, like we were
What the Canopy.co team learned when a DDoS attack took out our DNS provider in 2014.
DNS isn’t a sexy topic. Unless you’re a network or infrastructure engineer, you probably think as little as possible about it — until something goes wrong. If you run a site and haven’t thought much about DNS, now’s a good time to take steps that can keep your site/services up when your peers go down.
On Cyber Monday (December 1, 2014) something went very wrong. A massive distributed denial-of-service (DDoS) attack overwhelmed servers at DNSimple with a huge surge of traffic. Hundreds — if not thousands — of sites went down, including Exposure, Dribbble, RVM, Canopy, and Pinterest.
It’s easy to criticize DNSimple and lots of people have. Whenever there’s a major DNS outage, managed DNS providers (companies that runs nameservers for you) absorb plenty of criticism from their customers.
But it’s our responsibility, as developers, to make sure what we build is resilient and fault tolerant — even (and perhaps especially) if we use third-party services.
Without realizing it, many of us are building a single point of failure into the most easily and often targeted layer of our stack. In this case, a single customer of DNSimple was targeted, but all its customers were affected.
The good news is that there are a couple of steps we can take make our sites significantly more resilient:
1. Change to longer TTLs
Additional complexity: trivial
Developers like things to be fast. But when dealing with TTL, faster is not better.
TTL means time to live, as in, how long a record will stay living in a cache before it’s cleared and needs to be looked up again.
It doesn’t mean time to live, as in, how long this will take to go live. (Although the two are related, it’s an important distinction.)
Because developers mainly interact with DNS records when we’re making changes, it’s tempting to optimize to get changes to take effect as quickly as possible. But shorter TTLs make you more vulnerable to outages because it requires your nameservers to respond to way more queries.
With a 60 second TTL, you’re telling DNS resolution services (which sit between clients and your nameservers and have big caches) to purge your records from their caches after only a minute. Then they’ll have to query your nameservers again. This means that if there is an outage, after only 60 seconds, your site is guaranteed to be unreachable.
The longer your DNS records stay cached (i.e., the longer your TTLs), the more likely your site will stay up even when your DNS provider has an outage. If your server IPs aren’t changing constantly, you’ll be better off with TTLs as long as a week.
Canopy is hosted on Heroku, so we use a CNAME that points to a Heroku domain. It never changes, so a 60 second TTL is completely unnecessary. Every 60 seconds, we were making thousands of resolvers at DNS resolution services ping our nameservers to get the same domain they already had.
If our TTLs were a full week on Monday, the majority of our users would have seen no downtime whatsoever due to caching upstream, despite the DNSimple outage.
A tip when dealing with longer TTLs:
If you switch to longer TTLs and then you need to make breaking changes to your records, you can always preemptively lower your TTLs before you make any changes. Then just wait the length of your longest TTL so the caches clear, make your changes, and switch back to longer TTLs once everything is stable.
2. Use nameservers from different DNS providers
Additional complexity: medium, requires team coordination
If you’re using a single managed DNS provider, you have a single point of failure — even if you have multiple nameservers.
That’s fine for lots of sites, but if your goal is 100% uptime, you need more protection. Using multiple DNS providers is one the best defenses against this kind of outage.
DNS providers allow and encourage you to use 4–6 redundant nameservers. This is great: if one fails, requests will still be resolved by the others. But if all your nameservers are from a single company, you’re putting a lot of faith that they are going to have 100% uptime. Even though DNSimple was providing 4 nameservers for Canopy, they all went down with the DDoS attack, so we went down too.
You’re much more likely to stay online through a DDoS attack if you have, not just redundant nameservers, but redundant DNS providers.
Unfortunately, actually using nameservers from multiple companies is more complicated just adding an external nameserver into the form pictured above. This nuance isn’t very well known, but it’s an important one:
When a DNS resolver looks up a domain, a DNS server gives the answer, but also sends records to authenticate the server response. These are called NS records and they need to match all your nameservers. (If they don’t match, some people may not be able to reach your domain.)
So if you want to use 3 nameservers from Amazon’s Route 53 and 3 from PointDNS, then your NS records on Route 53 and PointDNS must include all six nameservers.
Unfortunately not all DNS providers allow you to edit NS records. DNSimple is among them, but after the attack, they’ve realized that it’s a priority and are working to add support in the coming months.
UPDATE: Secondary DNS is now supported at DNSimple.
If you want to have nameservers from two different companies, they both need to have editable NS records.
It’s definitely more work to use nameservers from multiple companies, so it probably isn’t worth the effort for all sites. For example, I’m sticking with a single DNS provider (with multiple nameservers, of course) for my personal site. If it goes offline for a couple hours, it’s just not that big of deal.
But for an app like Canopy, going offline is unacceptable. It’s confusing and inconvenient for users, and it hurts our brand. I’m willing to deal with the syncing records at two providers if it can reduce risk of an outage.
While managed DNS providers have a responsibility to minimize downtime, it’s our job as developers to use tools like DNS effectively. Many sites — even big ones — are vulnerable to a single point of failure. If you can’t afford downtime, try splitting your nameservers across two DNS providers. And leverage the caches at DNS resolvers by increasing your TTLs.
Regardless of who you choose to manage your DNS, these changes will significantly improve your site’s resilience.
Updated in 2016: Moved TTLs above multiple DNS providers because it’s low hanging fruit. Added notes about secondary DNS at DNSimple.
Do you have additional insights, suggestions, or corrections?
Reach out @brianarmstrong.