Control and Command in provider networks (Part 2)

The following is a repost of our CEO (Taras Matselyukh)’s post on TerabitSystems.

Hi, it’s me again. Hope your journeys in the deep oceans of data systems were met with warm and inviting telemetry data flows. Hold on a second — how exactly can we envision these data flows actually? Above, you can see a short movie clip that I captured in our lab. These are three data flows deriving from ‘tailing’ the syslog files from three separate servers. Somehow, this picture reminds me of a scene from my favorite Matrix movie…

This is only a tiny speckle of reflection of what might be happening in the real production network, where you will have thousands, if not millions of such streams converging into rivers of data and eventually falling into data seas and oceans. Believe it or not, this peaceful scene, reminiscent of that matrix-screensaver on the older versions of the operating systems, contains all critical details which will either make or break your revenue generating systems and services. And human operators can find and spot these important details in the logs… or so we hope.

Breathtaking speed

However, the average person can read through a page of text in about 1–3 minutes. Experienced operators can scan through 10–30 pages of logs per single log stream per minute. The best of the best can even process a breathtaking 100+ pages of logs per minute. But their ability to spot and process the tiniest of details decreases rapidly. I guess Neo was the exact example of the super human who could “comprehend” the logs from the matrix system. However, he remains at a God level still to be matched by mere humans … and still to be born, I think. Now let’s do a quick math dance and spin these numbers around.

1 human operator can process 10 pages of 80 char x 30 rows of syslog information. His accuracy (ability to detect the critical indications and developments in data) is 99%.
1 web server can generate 100MB of logs per day
1 network can generate 1GB of logs per day.
Question: how many human operators would service providers require to process all log lines?
Let’s assume that a single character on the screen is represented by 1 BYTE (ASCII)
10x80x30x8Bytes = 192.000 Bytes per minute. Let’s assume 192KB/minute, which is a stunning speed for a human, equivalent to reading through 5 pages of the book per minute
100MB/(192KBx60mx8h)=1.1111111 man/day.

As we can see, our best human expert can barely keep up with the task of reading logs from one web server and monitoring for critical developments, but only assuming that he/she does not get interrupted or distracted for a second. He/she would not have time for breaks, reflection on the meaning of the received data or any necessary troubleshooting. And most of all, this human should not get tired. Not feasible, is it? But even then, our operator would still miss some critical clues. How many? 0.9⁹⁸=0.92 or 8% of critical clues will be missed.

1GB/(192KBx60mx24h)= 3,79 men/days
0.9⁹²⁴=0.78 or 22% of the critical clues will be lost.

Summary: The typical network will require at least 4 very capable human experts working tirelessly in 3 shifts all-round the clock to process the amount of logs from a modern (small) provider network. Even then, they will fail to notice about a quarter of critical system issues brewing.

Crippling network or system outages

In real life, however, humans will get tired, distracted and bored. Ultimately, this example will end up with operators ignoring the date streams completely and missing most of the clues. What would be the impact to the system and the network? Well, we have statistics that illustrates that more than 80% of polled companies in US, UK and Canada suffer a crippling network or system outage each year. In the next blog I will take a look at this example from the practical point of view, revealing some of the industry average statistics and attempting to analyze why.

Meanwhile, I wish you safe sailing in your data seas!

Taras Matselyukh, CEO (OPT/NET)