That Map of the Internet Failing You Saw on Friday Didn’t Tell the Story at All (and Here’s What Really Did Happen)
This article fits in with two others I’ve written since Friday: IoT Makers Could Fix Things, But They Won’t and Why the Internet broke and you couldn’t do anything about it.
It was a convenient picture, and one that I found compelling, too: a heatmap showing outages across the Internet due to an Internet of Things (IoT) botnet attack that was crippling a private Internet infrastructure company’s ability to respond to requests. The map apparently showed Level 3’s network; Level 3 is one of the largest network providers, transiting data among networks large and small. A congestion or outage would degrade everyone’s ability to reach certain networks.
Except the map we all shared, including me, didn’t show the status of Level 3’s network at all—its network and others were not under attack. Sites weren’t unreachable because the Internet was overloaded. I’ll explain below what actually happened on Friday.
The map was from Downdetector, which continues today (Sunday, October 23) to show the same pattern of outages for Level 3.
Downdetector doesn’t probe routes and check for connectivity at network interchanges, as other Internet health maps do, like Internet Traffic Report, Keynote’s Internet Health Report, and Akamai’s Real-Time Web Monitor. Rather, it compiles reports of outages and plots them on a map.
Downdetector collects status reports from a series of sources. Through a realtime analysis of this data, our system is able to automatically determine outages and service interruptions at a very early stage. One of the sources that we analyse are reports on Twitter.
The number of reports is tiny. Flip to the chart view instead of the map view, and you see that dozens of reports result in a map that looks like major parts of the U.S. Internet are unreachable.
Some appearances of this chart went so far as to attribute the map to Level 3, despite Downdetector’s disclaimer:
Downdetector and its parent company Serinus42 are not associated with any service, corporation or organisation that we monitor.
What appeared to confuse many reporters and editors working on this story into using this map, and even attributing it to Level 3, is that the Downdetector result was shared early by those trying to figure out what was going on; it appears at the top of Google results for “Level 3 outages”; and the labeling of the map, which uses Level 3’s corporate mission statement and logo, makes it appear official. One clue it wasn’t? The site shows Level 3 (space between Level and 3 in all its official text uses) as “Level3” without a space. (I’ll be surprised if Downdetector doesn’t get a demand from Level 3 and others to display more prominently a disclaimer about its unofficial status.)
Level 3 doesn’t offer an outage map, so it doesn’t appear in Google results; and the map confirmed people’s expectations of how the Internet was behaving.
Level 3 went so far as to host a Periscope session with its chief security officer to go through the details, because the map was being used so widely.
Level 3 (@Level3) on Periscope. Official account of Level 3 Communications, a premier global communications provider…www.periscope.tv
What the map showed is that people across the U.S. were having trouble reaching popular sites, some of which rely on Level 3. But what really happened had nothing to do with “routing”—getting data packets from one point to another on the Internet. Rather, it was about phone directories.
DNS Makes the World Go Round
So what actually happened on Friday? You can read expert accounts of it everywhere; read Brian Krebs account, for deep detail. (Brian uses the Downdetector map and originally attributed it to Level 3, later changing the caption. Even experts were confused.)
But I can summarize why it’s so confusing to people without Internet infrastructure expertise—it involves the domain name system (DNS), which most people don’t have to interact with at the technical level. DNS connects human-readable domain names, like www.glennf.com, with the underlying numeric addresses that are used for actual connections, as well as other aspects of domain plumbing, like where to deliver email.
The attack on Friday came from an IoT botnet: a bunch of poorly secured hardware with Internet connectivity that attackers have modified remotely to run additional software and control them remotely. Tens of millions of IoT devices have been subverted—from IP cameras to toaster ovens to DVRs—and attackers can activate 100,000s to millions at a time to flood a spot on the Internet with traffic, a so-called Distributed Denial of Service (DDoS) attack.
Friday’s attack flooded Dyn, which among other services handles part or all of DNS for many companies with large numbers of users, including Twitter, Netflix, Amazon, Github, and others.
DNS acts as the Internet’s phone book. Nobody would want to memorize the current phone numbers of a hundred friends, family members, and businesses, but it’s easy to remember a hundred domain names. And people’s phone numbers change; their names typically remain the same. Even further, imagine each of your 100 closest contacts had several different phone numbers—impossible for any person to keep track of.
DNS abstracts the human-readable name from the underlying numeric and technical data. It hides the complexity of several things, including that monolithic Web servers and domains you’re familiar with, like Amazon.com and Google.com, represent hundreds to hundreds of thousands of servers distributed around the world.
DNS lets your computer or mobile ask a simple question—“How can I connect to this domain?”—and receive a simple, up-to-date answer in reply. This is what greases the wheels of the Web in specific and the Internet in general.
But it uses a decentralized system to work. Every device that can connect to the Internet has to know how to look up domain names and get the associated numbers and other data. You prime this pump by entering DNS server addresses, whether from your ISP or a public DNS provider like OpenDNS. Your device queries the DNS server, which then sends out a question in order of hierarchy until it reaches the end of the chain.
Domains are hierarchial from right to left, separated by dots: the broadest is the invisible “.” (dot) that’s at the far right of every domain name, and implied as the root. Next is the top-level domain name, like .com, .gov, .aero, or .uk. Each piece further to the left is more and more specific, until you reach the server that has the actual goods: the specific Internet Protocol (IP) addresses, mail server names, and other DNS values that a domain’s owner has set for their domain.
For, let’s say, glog.glennf.com, these questions get asked by an iPhone, a refrigerator, or a Windows PC:
- Who is authoritative for “.”? Many systems around the world act as root-level authorities. This root information is baked into a DNS server. Without it, a DNS server can’t discover any domain information at all.
- Hey, root-level server, who is authoritative (has been delegated to have the answer) for .com? VeriSign has the contract to run .com’s DNS infrastructure, though thousands of companies can sell .com domains, which are called registrars. The ICANN non-for-profit group provides governance and authority for contracting DNS infrastructure and licensing domain sales.
- Ok, VeriSign, where can I find glennf.com’s information? I use Dynadot as my DNS host (it runs nameservers that include entries for my domains), so VeriSign tells the querying device where to find Dynadot’s nameservers. (A registrar, who handles your domain ownership, and a DNS host, which provides DNS responses to queries, can be the same entity, but it’s not required.)
- Listen, Dynadot, what’s the IP number for glog.glennf.com? One of several Dynadot servers replies, 188.8.131.52.
The last part of that chain, a nameserver with the details for a given domain, is what broke during Friday’s attacks. The Internet wasn’t generally congested and paths weren’t unavailable. Rather, the phonebook had been shredded.
Some very large number of devices were sending what looked like legitimate DNS queries to Dyn, and its network and systems couldn’t keep up. Meanwhile, truly legitimate users, like you and me, were trying to load Web pages, use email, and stream Netflix, to no avail.
Our devices would say, “hey, what’s Netflix.com’s IP address(es)?” and the DNS server that received that request would follow the hierarchy noted above and then wait while a request timed out to Dyn. (This should only happen once, even if you have multiple DNS servers listed for your device or on your router, because the DNS response timed out—the server asking Dyn—but the DNS server your device interacts with can still relay the timed-out message.)
To most people that looks like “the site is down” or “the Internet is slow.” Many people, including yours truly, would then hit the refresh button on a browser or try to restart a connection, and send the query again. That just increased load on Dyn.
A related factor is that when you own a domain name, you can configure it to provide an answer that says “this answer is good for X seconds.” (This is the TTL or time-to-live value.) The DNS server your device consults may also provide answers for thousands or millions of other bits of hardware. This caching time lets it ask once and then retain the answer for the period of time the domain owner states.
In the olden days, the default was often a day (86,400 seconds). But the vagaries of the Internet led people to turn that value down, because otherwise you couldn’t change out a server or perform other operations and have the cached values around the Internet time out quickly enough. Visitor would get the “old” value from their DNS server’s cache. While an hour is now common, Amazon sets its TTL to just 60 seconds.
That means any DNS failure very, very quickly results in a global failure. Some DNS servers override the TTL to reduce queries, but usually not on the order of hours or days.
Dyn can probably answer tens of millions of simultaneous DNS queries—maybe more. And this attack outstripped the defenses it, like all Internet firms (infrastructure or not), constantly deploy.
Why Couldn’t the Attack Be Easily Deflected?
The IoT-based attack flooded Dyn’s infrastructure so thoroughly that the company’s nameservers, the ones that would provide the definitive information at the end of that chain, didn’t have the computational capacity or bandwidth to answer.
Dyn, like all global infrastructure companies, doesn’t have all its eggs in one basket. While it’s headquartered in New Hampshire, it has servers in data centers around the world to distribute load and provide information on behalf of its customers at the topographically closest point to the query—that is, it tries to provide the fewest numbers of Internet hops between a device making a query and a server with the answer.
An IoT attack looks just like regular user traffic, for the most part, and Dyn says tens of millions of IP addresses were involved. Thus, some normal mitigation that would let it block attacks and distribute the load aren’t as effective. We don’t know yet how they deflected two major attacks on Friday; a third wave was apparently more easily beaten back.
DNS is a weak link in the Internet chain, because it’s required for the system to work and decentralized to reduce any single point of attack. But each link in the chain remains vulnerable. So many major Internet firms relying, at least in part, on Dyn for DNS hosting revealed another weakness.
Glenn Fleishman is a veteran technology writer who contributes regularly to the Economist, Macworld, Fast Company, TidBITS, and other publications.