Going open-source in monitoring, part 0: Intro

Sergey Nuzhdin
May 12, 2017 · 3 min read
Image for post
Image for post

This post is one of a series of posts about monitoring of infrastructure and services. Other posts in the series:

0. Intro (this article)

1.Deploying Prometheus and Grafana to Kubernetes

2. Creating the first dashboard in Grafana

3. 10 most useful Grafana dashboards to monitor Kubernetes and services

4. Configuring alerts in Prometheus and Grafana

5. Collecting errors from production using Sentry

6. Making sense of logs with ELK stack

7. Replacing commercial APM monitoring

8. SLA, SLO, SLI and other useful abstractions

Monitoring of the infrastructure is an essential part of any product. But it’s not uncommon for companies to postpone monitoring for the later period. Having it in “nice-to-have” bucket. That’s one of the reasons why they spend a lot of time reacting to the problems after service disruption. The uptime of the infrastructure is as important to the product as the product itself. Monitoring is especially crucial in modern cloud based systems. Containers and even nodes could die any minute, and you will not be able to analyze logs afterwords. Monitoring as a base for any service is perfectly described in a Dickerson’s Hierarchy of Reliability.

Image for post
Image for post

Monitoring is hard. But collecting the metrics is only a part of the problem, not the hardest though. Interpreting the data and using monitoring systems is much more difficult. It does not matter if you collect all possible metrics if you do not have proper dashboards and alert systems.

For a long time, I was using systems like NewRelic for monitoring my infrastructures and applications. One of the main reasons for it was ease of install and ready to use dashboards.

Recently, I decided to stop using commercial monitoring systems in favor of the open-source. It’s not the first attempt, though. Previously I already tried to use Prometheus and ELK stack (ElasticSearch, Logstash, Kibana). But, I never actually did a full switch. Even having the system in place I did not use it very often. So, sooner or later, the disk space or other resources were needed, and since there was NewRelic, these systems were deleted.

The main reason for the failures, I believe, was the default place to go. You can’t have several systems doing almost the same. One will always be the monitoring system. One that you will open when something is wrong. In my case it was NewRelic.

Another reason is not enough time spent configuring. Since you’re trying to replace an existing system, which works, you’re not committed. You will always have an excuse like “it didn’t work for me”, or “I didn’t have enough time to configure it”.

This time I decided to stop using NewRelic, and make Grafana my default place to go.

In this series of posts, I’m planning to describe my migration path. The end goal is to have complete monitoring of the Kubernetes based infrastructure. Having alert system in place. Configure and visualize logs from the infrastructure and applications.

And the final and the hardest step is to replace APM.

Stay tuned.

Like this article?

Click the 💚 below so other people will see it here on Medium.
Subscribe to get new stories delivered to your inbox or follow me on twitter.

Originally published at blog.lwolf.org on May 12, 2017.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store