What Happens When I Press This Button?

Published in

Shifted

10 min readJun 19, 2018

With Assurance, Cisco Takes Aim at Two of the Most Common Networking Issues: Troubleshooting and Human Error

In 2013, Tom Edsall, the CTO of Cisco’s Insieme engineering team, gave a talk titled “The Changing Data Center.” During this discussion, Edsall gave his perspective on the ongoing evolution within data centers toward becoming more software managed and application-centric.

While by then the topic was already considered one of the year’s most discussed and debated tech trends, Edsall also shared an industry insider’s point of view that troubleshooting a network was still a frustratingly manual and archaic process, using tools that were not up-to-task.

“What do you think the two most important tools are when debugging [troubleshooting] a network? Ping and traceroute. That’s pretty lame, right? And by the way, those don’t work very well in [multipath] networks. But today, that’s the bread and butter for debugging a network.”

He went on to say that network owners need more sophisticated tools to troubleshoot and gather data about the network. Command line, ping and traceroute just wouldn’t cut it in a network run by software and tailored to deliver application performance.

Edsall’s insight around the need for a more modern toolset for networking may have been more forward-thinking than he realized since a few years later he would take a senior vice president role overseeing the group that would develop one.

What Is Assurance?

Though it wasn’t a widely used term at the time, Edsall was referring to the concept of assurance — a confidence that your network is operating as you intended. Assurance would give the ability to verify that performance confidently and to quickly find issues and troubleshoot them simply when they occur.

Intent spans literally everything in the network, from the layer one and two routes, VLANs, subnets, BGP, and switching, all the way to the application-level policies for security, QoS, VM configurations, access groups, and even higher-level intent, such as business and regulatory compliance.

Network assurance is the capability that ensures all the network’s functions are in place and working as they should. Assurance can even answer the age-old question, “what happens when I press this button?” before actually pressing it.

Developing a platform with the ability to answer that deceptively complex question represents one of the largest advancements in networking history. While bold, it is nonetheless a true statement.

And the team that built it, tackled the challenge in just two years.

Every Company Is a Tech Company. And Every Tech Company Needs a Network.

By 2014 it was clear, as Cisco wrote in an earnings report, that every company would, in one way or another, become a technology company. This evolution would drive a profound shift in requirements for the network.

“It is clearer than ever that every company is becoming a technology company, with the common element being the network at the center, driven by applications and enabling the rapid introduction of new business models…Every company is increasingly dependent on the network, not just for communications but also for how the company runs, analyzes, and grows its business.”

At that time, software-defined networking (SDN) was touted as the big answer. Industry experts and analysts hailed it as the saving grace networks needed to manage the growing complexities coming from mobile, IoT, cloud and virtualization.

However, some industry analysts also saw an emerging issue. Rather than “the solution” the computing world had been waiting for, SDN could actually compound network complexity.

Confidence — The One Big Problem

“The one big problem we found was that people didn’t have confidence in their own operations of an SDN network,” said Sundar Iyer, a Cisco distinguished engineer and former co-founder and head engineer of Candid Systems.

“Pretty much every aspect of networking is reactive, and it’s been that way for the last 20 to 30 years.”

Candid was started as a Cisco Alpha project in 2015 to develop a software platform that would provide network assurance for data centers. Formed with the help of top-level engineering minds from Stanford, the University of Pittsburgh, Purdue, the Indian Institutes of Technology and other key universities, the project was codenamed Candid because it provided instant and candid feedback about the state of a network. Now under the Cisco banner, the Candid platform today is known as Network Assurance Engine.

While not a stranger to Cisco, having founded previous startups acquired by the company since 2005, Iyer re-joined in 2014 interested in finding a new, challenging problem to tackle. He wanted to take on a problem that was complex, would have a long-term impact, and could be solved mathematically. Based on his past experiences, he thought data center networking was a perfect place to start.

“Pretty much every aspect of networking is reactive, and it’s been that way for the last 20 to 30 years,” Iyer said in an interview. “And with SDN-based networks, while we have provided vast innovations in automation and agility, we’ve amplified one problem: Today you can make a mistake and push it into a controller that programs a thousand devices in literally minutes, and you can bring a whole data center down with the click of a button.”

Predicting Outages with Math

Iyer wanted to find the top common reasons data centers experience outages, and then create a way to mathematically predict and even prevent those events before they happen.

To have software code that predicts outcomes when a new route is injected to a network, or a new spine created, or a new switch added, the software would need to use models of every possible state the network could take. Every single state possible.

Iyer and the Candid team worked exhaustively to calculate all the states a single packet could go through and found it would take on the order of 2 to the 144th possibilities.

That’s more states than there are stars in the known universe.

To do these calculations, Iyer and his team started by mining Cisco’s 30 years of networking history to find the most common issues in data centers, their causes, and their resolutions.

“We want to make a claim on every conceivable flow [your network] will ever see. That’s the only way to be proactive.”

In February 2015, Iyer and his team mapped out a proof-of-concept for Candid and pitched it to Cisco which, resulted in a round of funding.

From there, the team began writing algorithms that would calculate the massive number of models they would need. The idea was to use the team’s advanced math and programming skills to predict every possible outcome in a data center, much the way NASA would for a mission like landing the Mars rover.

“Let’s take a single problem like network security, and look at one particular aspect of it, like ‘Are your security policies currently programmed?’” Iyer explained. “And let’s look at one network switch and see if we can mathematically say something nice about that switch and its configuration.”

Iyer said his team originally attempted to do this modeling on a single switch using open-source formal mathematical tools. But when they applied it to 60 security policies as a test, the tool took six hours to verify.

“Sixty policies is nothing when you manage millions of groups,” he said. It just would not scale.

Iyer and the Candid team worked exhaustively to calculate all the states a single packet could go through and found it would take on the order of 2 to the 144th possibilities.
That’s more states than there are stars in the known universe.

To help speed things to a more realistic time frame, Iyer and his core Candid team paired with academic groups and PhDs from the University of Pittsburgh, Stanford and Purdue to build formal mathematical models catered to networking.

They also worked with Cisco Advanced Services and the Technical Assistance Center to pull historical outage data to identify the top data center issues along with their likely causes. Pairing with those internal Cisco groups gave Iyer and his team access to 30 years’ worth of data around data center outages, reported human errors, hardware problems, and software programming issues, as well as complex multi-vendor issues.

Prediction without Traffic

Anyone who’s worked in one knows, making changes in a data center often breaks things. Make a change in routes. Add a new switch. Apply a new policy that says ‘server A can’t talk to server B,’ and the network administrator soon gets an angry call from a developer saying her storage cluster is no longer accessible.

With a mathematical model that can represent any possible state in a network, Iyer says it’s now possible to predict these kinds of problems.

Network Assurance Engine gives a granular view of a data center’s state, down to single routes, each clickable to provide context around what individual policies affect it.

“Just to check if your security policy on your controller and your switch is correct requires on the order of [2 to the 144th] state combinations to be tested, because the state space is so large,” he said. “Assurance does not just tell you that for 10,000 connections or flows in your network, things are proper because that’s monitoring. Monitoring can tell you if something is currently good or not. We want to make a claim on every conceivable flow [your network] will ever see. That’s the only way to be proactive.”

So unlike monitoring, Network Assurance Engine can warn an engineer that an action may cause, for example, an outage or a policy violation, before any traffic passes back or forth.

How It’s Possible — Intent. Transference. Data.

Iyer said that the concept doesn’t reinvent the wheel. Formal models are not novel. In fact, academic researchers have looked at this technique for almost a decade for its potential applications to networking. But, Iyer says, there are three main factors why this level of comprehensive modeling and predictive networking is more practical now and has not been done in the past.

Unlike monitoring, Network Assurance Engine can warn that an action may cause, for example, an outage or a policy violation, before any traffic passes back or forth.

The first factor is intent, a term that is getting a lot of attention lately for its projected station in the future of networking technology. For the first time in the history of networking, it is possible to manage a network by expressing an intent through software and applying it as a policy onto hardware.

Iyer says Network Assurance Engine ingests the network owner’s intent through ACI policies, and then constantly scans the network’s states about every 15 minutes to make sure those policies are in place, and that the network is configured and operating the way it was intended.

Done manually, this would be practically impossible.

“If you look back 10 years, you would have had to look at 50 different devices to read their configurations and state, and you wouldn’t fully know what the top-level intent was,” Iyer said. With intent-based operations, the network itself helps engineers predict issues, answer questions about their own network, and it can even prove the configuration and dynamic state are correct with just a few clicks.

The second factor, Iyer said, is the quality of information transference. Info transference is the ability to pull data or information from a device into a report or chart or something useful for the network engineer to analyze. Iyer says the quality of this process has changed dramatically in the past few years. Using APIs, a network engineer can now query anything about a device and, in return, receive a readable, hardware-independent format. APIs then replace outdated processes like using SSH to connect to a switch and running a low-level command, then writing multiple custom parsers to understand the data.

Third, Iyer notes the quality of data has vastly improved. It’s common now that devices provide data in a hierarchical manner. “When you query a device, you not only get information about that device specifically, but you get its context and hierarchy around it,” he said.

The result is a lot more context and insight about the network as a whole, rather than disparate information about single devices alone that must be stitched together.

“For the first time in networking you have all three of these things coming together, and so you get a rich amount of data that you could mine, understand and build things around.”

Preventing Major Outages

Iyer says so far Cisco customers are positive about the assurance platform in its early days since deploying in January this year. His team shared data showing that the platform has even prevented some major outages at companies that operate data centers in critical networking environments like manufacturing.

An example was a heavy equipment manufacturer. This particular company measures outage losses in the thousands of dollars per minute. Essentially, any production outage is a huge issue.

In this case, an innocent human error left a WAN interface with no contract to a disaster recovery data center. Normally when an issue with the mainframe caused traffic issues, the WAN interface would then failover to a secondary.

But in this case, should a disaster hit in which the mainframe went down, the WAN traffic headed to or from the failover mainframe would be dropped, meaning there effectively would be no failover at all, which would have halted production entirely.

Network Assurance Engine proactively scanned and found that small, lurking error propagated within tens of thousands of lines of configuration code.

By finding the issue before an event, the company prevented what may have been an inevitable million-dollar outage that would only show itself with the mainframe failed.

Considering that unplanned data center outages are most commonly caused by human error or implemented changes that were not evaluated properly, assurance is an appealing insurance policy.

And the best part, Iyer said, is the install and operation is very lightweight.

“The Network Assurance Engine modeled a data center with over 100 leaves and ran on just three VMs. It installs in less than 30 minutes. Once you put credentials into ACI, it goes, discovers, and models the whole network in less than 15 minutes.”

To learn more about Network Assurance Engine, go here.

Thanks for reading. Hit follow to keep up with all the insights from Shifted. You can also follow me on Twitter: @owen_lystrup.

What Happens When I Press This Button?

Written by Owen Lystrup