When it comes to network resilience, more is not always better
The Internet is a paradox. Engineered to withstand nuclear attacks by determined adversaries, it is nevertheless routinely brought to its knees by small lapses by people responsible to keep it running smoothly. I have spent over a decade across academia and industry studying this paradox and developing technologies that make the Internet resilient to all types of vulnerabilities. That is why I was glad to see this topic recently get attention at the highest levels of the US government, in a Senate hearing on “Building Resilient Networks.” But watching the hearing, I was disappointed that the discussion completely overlooked a key point about network resilience — more infrastructure does not necessarily mean resilient infrastructure.
In his opening remarks, Senator Ben Ray Luján motivated the hearings via an outage in the CenturyLink (now Lumen) network. This outage started in the morning of December 27, 2018 and lasted 37 hours. It impacted as many as 22 million customers across 39 states, approximately 17 million of whom were left without reliable 911 access. He rightly asked how such outages could be prevented. The discussion that followed, however, was mostly around redundancy, rural broadband access, backup power, regulatory burden of using public lands to build new networks, spectrum for public safety, etc.
These are all important issues, but they are about building more infrastructure, not about building resilient infrastructure. Study after study (here is a recent one) has shown that most network outages are caused, not by a lack of infrastructure, but by human errors in configuring and managing the infrastructure. Network configuration is a set of commands that engineers provide to dictate the behavior of equipment, and wrong commands can bring down the network or create security vulnerabilities.
A network configuration error was behind the December 2018 CenturyLink outage; and the June 2020 T-Mobile outage; and the July 2020 Cloudflare outage; and the August 2020 CenturyLink outage; and …. you get the idea. More infrastructure would not have prevented these outages. By analogy, the solution to the problem of collapsing buildings is not creating more buildings, creating taller buildings, or creating affordable buildings. We instead need better technology, better engineering, and stricter building codes.
Speaking of engineering, the reason why the Internet is more robust to nuclear attacks than “fat fingers” that enter wrong commands is: network engineering thus far has tended to focus on defending against bombs and natural disasters, but fat fingers need different defenses. When network equipment blacks out completely, redundant paths for traffic can help. But no amount of redundancy can help if the equipment is incorrectly configured to drop all traffic or clog the paths with garbage traffic.
I do not mean to suggest that redundancy does not help with network resilience — there certainly are failure scenarios where it can save the day — but by focusing almost exclusively on more infrastructure, the committee failed to discuss what specifically could be done to make networks resilient.
One promising approach, for instance, that it could have discussed is to automate what are manual activities today. To eliminate human errors, network management activities such as device maintenance and configuration updates must be automated. In addition, validating that network configuration changes meet critical security and reliability requirements must also be automated. Automating configuration updates, without automating validation, is of no use — and even dangerous — if the configuration change is wrong. Automated validation has been key to making hardware and software more resilient, and it can do the same for networks. Technologies that help automate network management and validation (e.g., Ansible, Batfish, NetBox) have made rapid strides in the last few years, and organizations for whom network resilience is mission critical are using them increasingly.
Automation is but one element to building resilient networks. The committee could have also discussed other technical solutions such as avoiding shared points of failure across multiple networks (which amplified the impact of the CenturyLink outage) and fast root cause analysis and mitigation; or policy solutions such as setting reliability targets for service providers (like broadband speed targets but focused on resilience) and letting users migrate to other providers when a provider fails.
That such solutions that can directly boost network resilience were not discussed was a lost opportunity. I hope that they will be included in future discussions on network resilience and what the U.S. government can do to move the needle.