Doing things in a manner that has never been optimal
In a previous article I have written about where operations management tools come up short. The fundamental analytical mistake they make is that there needs to be an evaluation of positive as well as negative inputs. What is meant by this is we need to understand normal operations, the positive, as well as abnormal operations, the negative. A metric is pretty much meaningless without context.
Many moons ago, TCP/IP was developed. In 1988, the year I graduated from Maties and more than a decade after TCP/IP, Simple Network Management Protocol (SNMP) was developed. Ever since, it has been the primary method by which networking infrastructure has been mined for metrics and it is the basis on which a multitude of Stupid Network Management Products have been flogged to the ever suffering network community. Some of these have been been unashamedly expensive, with some of those being nothing more than RAGs
Fundamentally a network error, outage or failure is never an event. A tool that is an event logging or “event management” system can also be added to our growing list of horror stories. The reason is that we need to view errors, outages and failures, not as singular events but processes consisting of multiple correlations that form a life cycle. The source of investigative knowledge with which we manage this life cycle consists of more than just network metrics or logged events. There are also attributes of people, processes as well as technology. The latter is typically the only one that is ever instrumented.
Thus let us start at the beginning and analyze how we deal with network incidents. We need to learn how pilots do it by using checklists. Although Charles Lindbergh, who flew non-stop across the Atlantic in the Spirit of St Louis, did not use checklists, they came into extensive use by pilots during the time of WWII. Pilots have been doing what the network community should be doing and they have been doing it for the better part of a century. Nearly a decade ago, I developed a networking troubleshooting checklist. It takes a person through a number of checks to valid network operations, just like a pilot would do with his plane. The pilot would execute the checklist using the visual validation of his cockpit dashboard. In flying a network we would do the same but with a network dashboard. Unluckily, due to the functional design of most network dashboard instrumentation, these is some crucial instrumentation missing as previously mentioned.The methodologies encouraged by the use of checklists dramatically reduces the time to troubleshoot a network as well as direct attention to the required metrics.
Even after ten years, there still isn’t a Stupid Network Management product that can automatically provide instrumentation for each of the checks in my checklist in a seamless fashion. The aviation industry is a benchmark, and network operation centres (NOCs) the world over should be assimilating how we control the sky. In the NOC, we can start with simple aggregated views!