Control and Command in Provider Networks (Part I)

The following is a repost of the first of a 2-part series of posts by our CEO — Taras Matseluyukh.

The ability to control vast infrastructures was always a matter of critical importance. From ancient civilizations to modern world computer games, the ability to remain in control of key assets and resources has always been the key to winning. Lack of such control has always led to the demise of empires and losing battles to more agile rivals. This topic has been illustrated countless times in historical books and movies culminating in my personal favorite (fiction) movie “The Matrix” where Neo struggles to differentiate between the real and virtual world, controlled by extremely advanced machines.

The Weak Link

In the early Internet days, networks and servers were controlled by human operators manually or via rudimentary network management systems like Ciscoworks and HP OpenView. Neither method produced good results and large parts of the network and server infrastructures were in darkness and dismay. These were the dark ages of networking, when it was taking ‘ages’ to roll out any configuration change or to notice critical omissions and costly disturbances. But most importantly, manual operator control poses the highest risk of all — human error. Statistically, human factor contributes to 80–90% of all incidents and perfectly correlates across different industries. In aviation transport industry, where regulations demand rigorous investigation of every incident, human factor contribution remains at this steady high mark despite enormous efforts to reduce risk and error. This is because we (humans) are the weak link in the complex automation chains and will always be, unless some dramatic revolutionary improvement is made.

Crippled Networks

In modern day networks automation evolved from simple scripts and rigid provisioning systems of early days to powerful and flexible powerhouse frameworks, allowing operators to control thousands of devices and choreograph virtual servers in complex mass performances. But surprisingly, the weakest element that plagued early day networks still haunts the critical infrastructures today at unprecedented levels. Limited to impact a single device or system in early days, the lack of knowledge, discipline or coordination can now affect very large portions of the infrastructure in an instance. I have personally dealt with aftermaths of wrong template deployments, rushed network wide software updates and proliferation of corrupt configurations which crippled networks for days and in extreme cases many months.

And many times it occurred to me how convenient it would be to see the real picture behind the countless lines of logs and telemetry in real-time, quite like Neo who was able to see the Matrix through the streaming debug on the monitors of the rebels.

Keep this thought till the next instalment of this blog series!

— Taras Matselyukh, CEO/CTO OptOSS