Achieving 360° Monitoring

Published in

SignifAI

8 min readJan 26, 2017

I was once in a start-up environment where we were literally “flying blind” for the better part of 6 months. We were in hyper growth, but had a limited budget and were constantly begging for breaks from all our vendors. We could barely afford the gear needed to put out to keep up with our growth. Needless to say, commercial monitoring tools did not make the cut, so we scraped by on a few open source tools that we weren’t maintaining very well, thus had very minimal data. We tried to fix downtime by guessing and experimenting, rolling back code or restarting components until things were running again. It was painful. After 6 months, we raised funds and someone in our tech ops (of two) made the argument that we should at least implement a tool, which would enable us to search and analyze our logs. The moment the tool was deployed, there was an instant impact; it felt like getting an IV of enriched hydration after a prolonged period of fasting. The team vastly cut the cycles of troubleshooting down, and our availability went up immediately. This really engraved in our mind the fact that monitoring = performance.

DevOps teams and professionals agree that they should have access to 100% of the data pertaining to the environment they manage in real-time, all the time. However, few companies feel that they are at that stage. We found, through a survey that we referenced in a previous post that roughly 50% of companies believe that they don’t currently have enough monitoring coverage.

The logical process that most companies adopt, in order to achieve the goal of 360°, consists of: conducting an inventory of all the components of the system, categorizing components based on some subjective measure of criticality, and implementation of a set of monitoring tools that would cover the lot. In doing so, most end-up with a lot more tools (and costs) than they need and a very noisy environment which becomes an impediment to ensuring high availability, thereby cancelling part of the value they are supposed to provide in the first place. Through the previously mentioned survey we also found that companies implement an average of eight commercial monitoring tools, while over 60% of respondents cited implementing 10 tools or more. There are several factors that can explain this. First, most monitoring tools are fairly specialized in scope, so a company needs a number of them in order to cover the entire system. Second, tool selection and adoption is not always centralized, with many people across the company having a say in the selection of monitoring solutions for the portion of the system that they manage. This in turn results in redundant and overlapping tools being implemented organically over time across the company.

Whatever the reason, one could argue that more tools could be a good thing, but does it translate into more data and better visibility? Not necessarily. When the data is spread out across siloed tools, it’s a lot harder to piece together the right answer to a question, you have to frantically jump back and forth between different tools, and it takes more people to do so. On average, we found that three to four people are required to solve any given problem. It takes longer as well, as the mean time to remediation is above five hours on average.

So, not enough or too many monitoring tools are both sub-optimal situations. How does once strike the right balance? What tools are absolutely required? We think that the toolkit in place needs to cover, at the very least, the following 5 categories:

Logs
Infrastructure
Application layer
Security
Business transactions

Let’s dive into each category

Logs monitoring

Log analysis tools store and map all the different log sources into a uniform, normalized data set that becomes indexed and searchable, so that reports and statistics can be derived from a previously heterogeneous environment. This is required by auditors for companies of certain sizes. While some of the features and functionalities are similar to that of the tools below, its applicability, while very broad, does not extend to application management or error monitoring. Some of the specialized companies in this space include Splunk, Sumologic, Logz and Elastic Stack.

Infrastructure monitoring

Infrastructure (whether directly managed or rented) refers to all the hardware, software, network resources and other services required for the existence, operation and management of an enterprise IT environment. It’s necessary to monitor these components as they need to be up all the time. The monitoring tools also include additional functionalities, such as, the ability to plan for upgrades or collaborate on outage management. Cloud providers offer these functionalities for your specific footprint on their cloud as well. It’s interesting to note that in a server less computing environment, infrastructure is being totally abstracted away. Similarly, significant adoption of this model would certainly affect the need for infrastructure monitoring tools. Current examples of companies and solutions in that space include, Datadog, Zabbix and SignalFX.

Application performance management

Application performance management (APM) is the monitoring and management of performance and availability of software applications. APM tools detect and diagnose complex application performance problems to help maintain an expected level of service. The key metrics monitored broadly include end user experience as well as the computational resources used by the application. Application monitoring tools provide administrators with diagnostic features that enable them to quickly discover, isolate and solve problems that negatively impact an application’s performance. Such tools can be specific to a particular application or monitor multiple applications on the same network, also collecting data about client CPU utilization, memory demands, data throughput and bandwidth. APM tools do not analyze log files and are not appropriate for security monitoring. The leaders in this space are New Relic, Dynatrace and AppDynamics.

Security monitoring

We recommend adopting a security information and event management (SIEM) software tool to get real-time analysis of security alerts generated by network hardware and applications. Such tools aggregate and analyze log files and security events for real-time trend analysis and increase a company’s personnel speed in the detection of attacks and the implementation of defensive actions. These tools are expensive to deploy and maintain and are also required by auditors and regulators for companies beyond a certain size; mostly relevant for larger SMBs and enterprises.There is some overlap with the logs analysis tools vendors, Splunk, for instance, is a major player in the SIEM space, but other companies include IBM, Intel, HP, Logrythm or EMC. There is a large number of other companies that focus on this space as well.

Business transactions

Business transactions are often an overlooked data category among DevOps professionals. The monitoring tools that track metrics, such as hourly revenue flows, pipeline activity, DAU/MAU, or app rankings, etc. do not typically come to mind of the technical operations teams. We argue that it is a mistake and that teams should demand access to key business metrics and run tracking, as part of their daily routine and analysis. Should something affect, say, the hourly revenue, executives will want to know if it is driven by internal factors immediately. It is not uncommon for DevOps teams to be solely focused on technical metrics and be somewhat dissociated from what the management team looks at. They run the risk of being perceived as misaligned on priorities, or find themselves on the wrong foot if changes in the business KPIs should precede changes in the monitored data. Right now, it is extremely rare to have end-to-end monitoring, where variations in key business metrics can be tied to changes related to the infrastructure. Yet, I am sure that most CEOs or management teams would find that illogical. We think that this is the area with the most room for improvement in general.

Then you need to centralize the information

Since none of these tools are meant for full system monitoring, you will always miss some key piece of information if you rely on only any one of them to solve an issue. Even if you have all these tools in place, it can be pretty chaotic to react to a large number of alerts coming from disparate parts of the system. Companies tend to realize pretty quickly that there is need for a centralized mechanism, both for the data as well as for the process of dealing with the issue.

Some companies do it themselves. One company might, for instance, centralize everything in real-time into a rewritten version of Graphite appropriate for their scale. They would also use a formatted version of Google docs that has basic templates (like headings, naming conventions) to act as an information sharing conduit to handle a specific incident, with sections on who’s working on what, facts known etc. This, alongside a real-time chat-ops, “virtual war room”, represents a fairly manual process, but, eventually, does the trick of finding the root cause and fixing an issue. This is just an example, there are many other self-built ways and heavily manual processes that companies have come up with.

Other companies use incident management tools like PagerDuty or VictorOps to act as a central hub that integrates with the rest of the tools. They enable the user to filter events by importance. The highest value feature allows for notification options and the ability to automate coordination and task allocation processes.

Final comments around one observed trend in the monitoring space

Each set of tool mentioned above presents unique features that are somewhat overlapping, but mostly complementary to one another. Together, they typically provide 360° monitoring, with a level of noise that is high, but unavoidable.

Many of the larger vendors are expanding efforts across the spectrum to become the end-all solution a company will ever need. This is due to the high level of effort that is takes to implement and manage each tool, extra tools needed to monitor those and the often overload of data that results.

Larger vendors feel that it would be simpler to deal with one broad tool than multiple point solutions. Last year, we saw Datadog, a traditional infrastructure monitoring tool, entering the APM space. Similarly, New Relic, traditionally an APM focused vendor, acquired Opsmatic and used it as the foundation for their infrastructure monitoring offering. Appdynamics and Dynatrace have been combining both sets of features for a longer period of time as well as a slew of other smaller vendors. These efforts have some merit as it simplifies monitoring to a certain extent, although no vendor has a best of breed feature in every single segment by any stretch.

We think that there is going to be some consolidation, but we don’t see companies deciding to go with only one broad tool any time soon. It will take years for the larger vendors to develop a feature set equivalent to the full range of focused solutions that are typically deployed. The other main reason is that companies are still predominantly going for “best-of-breed”, and what “best-of-breed” means to everyone is completely subjective, and therefore different from one company to the next. So, we expect to see every company to continue to have a completely different environment, with a completely different set of monitoring tools from one another, and that heterogeneity is not going away any time soon.

Thanks again for reading.

Cheers,

Originally published at blog.signifai.io on January 26, 2017.