Infrastructure Monitoring| Defense against surprise downtime

5 min readMay 24, 2018

e.g. monitoring dashboard. pic courtesy: google

Infrastructure monitoring is an integral and important part of infrastructure management. It is, in other words, IT manager’s first line of defense against surprise downtime. Not all, but issues can inject a considerable downtime to our live infrastructure, sometimes, it causes heavy loss of money and material.

Monitoring does, meant to collect time series data from your infrastructure, for its analysis, to foresee, an upcoming issue with the infrastructure and its underlying components, before the issue actually hits, giving an IT manager or the support staff a good amount of time to prepare a resolution plan and apply it.

It is, as same as, if we know, we are going to fall sick, we would better see a doctor and get medication to save ourselves some pain.

A good monitoring system in place, provides:

A measurement of performance of the infrastructure over time.
Node level analysis and alerts.
Network level analysis and alerts.
Downtime analysis and alerts.
An answer to the 5 W’s of Incident management and RCA.

What was the actual issue?
When did it happened?
Why did it happened?
What was the downtime?
What needs to be done to avoid it in future?

How to build a strong monitoring system?
There are a lot many numbers of tools available in the market that can help to build a viable and strong monitoring system. The only need, is the decision of which one to use and the answer lies in, what all, we would want to achieve with the monitoring in place?

Honestly, out of variety of tools available, few of them, are the paid enterprise monitoring tools, which are managed and backed by an enterprise, while, others being open-source, unmanaged or a community managed software's. The decision on which one to use varies with various financial and business factors in consideration.

I will here,however, keep my focus mostly upon open-source tools and how to create a strong monitoring architecture using them. To start with, we would have to consider some factors responsible for resolution of most of the issues in IT operations.

Log Collection and analysis
If I say, logs are helpful, that would be an understatement. Logs just not help in debugging the issue, they do provide a lot of information on how to proactively predict an upcoming issue.

Logs will be the first door, one would open in case of any issues with software components.

Fluentd or Logstash can be used for log collection, the only reason I would use fluentd over logstash, is its in-dependency on java process, it is written in C + ruby. Regardless, at the end it is log collection, we want to achieve.

The method of analyzing log data over time and producing real time logging metrics is log analytics, Elastic-search is one powerful tool that can do just that. All we would need after this, is some tool that can collect logging metrics and enable us to visualize the log trends, in form of charts and graphs. Kibana, is our best voted answer to that question.

Logs can hold sensitive information, few points to remember:

Always transport logs over a secure connection.
Logging/Monitoring infrastructure should be implemented inside the restricted subnet.
Monitoring user interfaces, however, like Kibana and Grafana, shall have restricted or authenticated access only to the stakeholders.

Collect Node Level Metrics

Not everything is logged!!! Yes, you heard that right, logging is intended for a software or a process. Not for all components in the infrastructure.

OS disks, externally mounted data disks/ EBS storage, CPU, IO, network packets, inbound and outbound connections, physical memory, virtual memory, buffer space and queue, are some of the major components those are not often showed in logs unless something fails for them.

So, how would that data can be collected? Prometheus gives an answer to this, all we need to do is, install node exporters on the VM nodes and configure it, to collect time based data related to these un-attended components. Grafana, on other hand, can make use of this data collected by Prometheus, to give us a visual live representation of our node’s current status.

Alerts and Notifications

It is a big world!!! Stakeholders for our infrastructure could be from any part of this world or from various parts of the world. We cannot complete all aspects of monitoring until the last aspect, alerts and notification. It is important, to send a notification to the stakeholders in case of issues, so it can be fixed and analyzed, to avoid in future.

Prometheus with predefined alerting rules, using its in house alert-manager and Grafana, too, can send alerts, as an open-source alternative for alerting tools.

There are many tools, that offers paid alerting and notification services. I would prefer OpsGenie as one, for it can send alerts, based on the on-call and work hours schedule of the engineers and stake-holders. Ensuring the alert gets someone ready to jump on it and fix it.

Combine all tools, that we discussed and we will see an architecture similar to the below one.

Conclusion

For all, we discussed, creating a strong and stable monitoring system. I hope, I was able to put a picture, of, what all a good monitoring architecture should include.

At the end, it is ones choice, to use a tool based on their need and their infrastructure. Almost all tools, in above article are open-source tools (except Opsgenie) and are used by various organizations for their monitoring purposes.

However, I tried to cover all aspects needed for building a system that can serve as viable monitoring solution for IT infrastructures, it is also possible that I overlooked something, in which case, leave your suggestions in the comment below, to help improve.

Infrastructure Monitoring| Defense against surprise downtime

Written by @bhishek Tamrakar