Member-only story
Closing the Loop on Testing Network Changes
“..the best way to guard against error is to design systems with layered and overlapping defenses…like slices of Swiss cheese being layered on top of one another until there were no holes you could see through” — from The Premonition, Michael Lewis
This story is written by Dinesh Dutt and Ratul Mahajan
Network changes, such as adding a rack, adding a VLAN or a BGP peer or upgrading the OS, can easily cause an outage and materially impact your business. Rigorous testing is key to minimizing the chances of change-induced outages. A central tenet of such testing is test automation — a program should do the testing, not (error-prone) humans. Test automation should target all stages of the change process. Prior to deployment, it should test that the change is correct and that the network is ready for it. Post deployment, it should test that change was correctly deployed and had the intended impact. This “closed-loop test automation” makes the change process highly resilient and catches problems as early as possible.
But writing the code to automate network testing can be quite complicated. For instance, if you were adding a new leaf, prior to deploying your change, you may want to test that the IP addresses on the new leaf do not overlap with existing ones. So, you may write a script that mines addresses from configurations and then checks for uniqueness. Similarly, post deployment, you may want to test that all spines have the new prefix. So, you may write a script to fetch and process…