The Need to Challenge Traditional IT Performance Monitoring

Arpit Jain
ArchSaber
Published in
5 min readAug 23, 2018

Traditionally, the two most critical pillars upon which a business rested used to comprise of

a) Development of the applications, services and solutions

b) Marketing and distribution of the product or service

In the modern cut-throat competitive environment, there is a third ‘pillar’ that has started to gain much needed prominence. This one revolves around the uptime and QoS. This third pillar has existed all throughout in the past, only that it used to get overshadowed by the focus on quality of the core products, distribution and sales.

Over the past decade or so, thanks to the availability of inexpensive and scalable distribution (for instance, digital marketing) and widespread availability of various technologies that aid product development (along with the contributions by the open source community), incumbents in various verticals are losing their dominance to disruptors. It is getting easier to build a fantastic product and to market it through various means. This is where this ‘third pillar’ can be a game changer in terms of differentiation and keeping the end customers satisfied. Ensuring a high quality end customer experience (quantified by uptime and performance of your product/service or SLA) has become more important than it ever was in the past. This is currently being done by devops and SREs who monitor the uptime and performance of the application or service in an automated manner, and whenever there is a breach of the SLA, react to the problem by identifying the possible culprits and fix it in minimal time.

There is a massive force in action that is making conventional methods of troubleshooting irrelevant: rampant evolution in the core technology that powers the business applications and solutions. Yesteryear’s traditional monolithic architectures have evolved into microservices architectures due to

a) the sheer scale that the business case requires them to handle (think of going from millions in early 2000s to billions in the Facebook era)

b) continuous evolution of the product itself owing to the competitive environment and changing customer preferences, new features (think agile)

Traditional monitoring works by gathering the key performance metrics from the infrastructure and applications and raising an alarm whenever an alert condition gets violated. These alert conditions have to be manually configured by the user, which is part of the problem why these tools are much hated. Adding to the pain, there is no smart layer in these tools that batches alerts together. In a microservices world, an IT incident typically consists of multiple components getting impacted, and these tools raise alarms for each one of those (or in fact, each configured metric). Imagine getting a flood of hundred email alarms (or worse, a phone call every minute) when you know that there is a problem and are already working on it.

Faced with an incident, you realise that monitoring tools are good at telling you the symptoms of an ongoing problem, but don’t do any further good. They overload you with a hundred graphs of metrics, many more logs and you have to form a mental model in your mind as to what could have transpired. Imagine having thirty different services running, with twenty of them in an impaired state. You are trying to figure out where the bottleneck in your infrastructure is, what changed in the last rollout etc. And this is definitely not a trivial task by any stretch of imagination. During this precious time, your customers are getting affected, your senior management is bugging you to fix the problem asap. A total chaos of an environment, no? No wonder it takes a huge mental toll and time to diagnose and resolve an incident.

Living in an era of AI, shouldn’t a smart monitoring tool take care of the thresholding and baselining by itself? Moreover, why should one get individual alerts for each metric when they actually belong to the same incident?

What if there was a tool that instead of simply throwing the symptoms at you, would do the diagnosis for you, in real time? This tool would tell you which line of code in the new deployment is the root cause or which resource got throttled. And not just the root cause (which has seemingly become an overused term), it would walk you through the exact chain of events that led to the final incident. Just the way you would replay the problem originating on one part of the infrastructure, and how it spread from one service to another, creating bottlenecks or errors along its path. With the diagnosis handy, you’d just have to focus on the resolution. A real saver of time indeed! And oh wait, how about getting proactive alerts? Wouldn’t a negative MTTR be a dream scenario? And of course, there would be the add ons of smart thresholding/baselining and alleviation of alert fatigue.

This is exactly what we, at ArchSaber, are building. Having worked as devops & SREs ourselves, having gone through the pain of waking up in the middle of the night to get our head around what’s really happening in the infrastructure (let alone resolution), we know the shortcomings of the current monitoring solutions. And this is what we’re trying to achieve: reduced MTTR (from hours to minutes) by way of actionable diagnosis and optimal utilization of a devop’s precious time.

Our fundamental philosophy is that an IT infrastructure can be mathematically modelled and the relationships between various components, applications, services and metrics (however dynamic they may be) can be gradually learnt by an intelligent system. This model then helps us provide the complete chain of events in explaining an incident or an anomalous behaviour.

(smarter) diagnosis vs vanilla monitoring

Feel free to drop in a note (arpit@archsaber.com) if you are curious how it works or any other feedback. Will be happy to have a discussion. Do check us out here.

Cheers!

--

--