Alert and Monitoring with Grafana

Hakan Eröztekin
Dec 24, 2020 · 14 min read

Trendyol is one of the most popular e-commerce websites in Turkey. At any given time, there are a vast number of customers surfing on the website and apps. This means that hundreds of thousands of requests are generated for our services every second.

To ensure high availability and seamless customer experience, we should closely monitor our services and interfere when there’s a problem. This is where alert and monitoring tools comes into play. Thanks to these tools we can analyze the status of our services and get a notification in case of any failure.

Today we’ll take a glance at some of the most popular monitoring tools, their features, and our use cases and we’ll discuss Grafana in depth. After reading the article, various monitoring tools will be introduced to you, and you will learn what Grafana can and can’t do for you with examples. This will help you decide whether or not should you use Grafana.

Topics
1. Alert and Monitoring tools
1.1 Kibana
1.2 New Relic
1.3 PRTG
1.4 Prometheus
1.5 Delivery Alerts
1.6 Grafana
2. Grafana
2.1 Monitoring with Grafana
2.2 Alerting with Grafana
2.3 Additional features
2.4 Lack of features
3. Conclusion

1. Alert & Monitoring tools

When you place an order at Trendyol, we designate it a cargo company such as Aras Kargo, Yurtiçi Kargo for it. Thereon, we inform their services about your order. Until your order is delivered to you, you can track your order, whether it’s shipped or arrived to the cargo office.

This process requires theintegration of multiple microservices. Technically, we have a custom microservice for each cargo company. Each of them listens to multiple RabbitMQ queues. So we need to monitor the queues’ statuses, RabbitMQ server status, microservices’ pod statuses and resource utilization for this process.

For this and other similar needs, we use Kibana, Prometheus, New Relic, PRTG, and Grafana. Each one serves a different purpose in our team. We’ll briefly discuss what these tools are, what are their pros and cons, and our use cases.

1.1 Kibana

Kibana is a great UI tool for Elasticsearch. Elasticsearch is an excellent database if your main use case is searching for data.

  • Using Kibana, you can create tables, graphs and dashboards for your data.
  • Kibana is the best monitoring tool for Elasticsearch. Yet it is just Elasticsearch oriented. So you can’t integrate it with other data sources such as MSSQL, Couchbase and RabbitMQ as you do with Grafana.

How we use

Each of your orders is shipped to you as one or several packages (we call them “shipments”, or “gönderi” in Turkish). We use Kibana to visualize how many shipments we have had in the recent weeks and also by cargo company. Since there is a vast number of shipments, the use of Elasticsearch enables us to query data quickly, and Kibana is the best tool to visualize Elasticsearch data.

1.2 New Relic

New Relic is a pioneer monitoring & alert tool where you can access a wide amount of data about your services.

  • You can monitor response time, throughput, errors, CPU and RAM usage of each Kubernetes pod and a lot more. You can analyze your endpoints’ performance solely by using Transactions feature.
  • You can add alerts to your services and integrate it with other services. For example, when the error count goes above 5% for 3 minutes you can get a message on your Slack with the error graph.

How we use

Error percentage alert from New Relic on Slack

We use New Relic for almost all of our services. We also pay attention to adding alerts to our services. During a regular day, New Relic notifies us via Slack messages whether there’s a high error rate or slow response time for each of our services.

1.3 PRTG

PRTG is a network oriented monitoring tool.

  • You can monitor a wide variety of metrics with PRTG. Some examples are; uptime, load, CPU/RAM and disk usage, database requests and endpoints’ status.

How we use

Free memory alert from PRTG on Slack

We use PRTG to monitor cargo company services (to ensure you can track your shipment on the app), and resource usage.

1.4 Prometheus

Prometheus is a tool to analyze your data on other data sources (such as RabbitMQ and Kubernetes).

  • When you integrate your Prometheus with other services, you get ready-to-use functions. For example you can see your total message count on RabbitMQ in near real-time.
  • Prometheus doesn’t have a great UI. Fortunately you can integrate it with Grafana and get to use the functions on Grafana.

How we use

We use Prometheus as a query provider for Grafana. Since Grafana integrates well with Prometheus, we can access all Prometheus functions on Grafana.

1.5 Delivery Alerts

This is a scheduler project we’ve developed on Java 11 to define our custom alerts which is not provided by other tools.

  • This project is listed to point out that using a tool is not always the best solution. You can easily develop your project for your alerting requirements.
  • Each day the scheduler runs custom queries on Elasticsearch and produces reports and publishes them on a custom Slack channel.
  • We can also define specific alerts for MSSQL database since Grafana SQL-like queries are not that great in terms of development and ease to use.
  • We also integrate some of our other projects with Slack so in some cases, we send a Slack message when a custom exception is thrown.

1.6 Grafana

Grafana is a popular tool to analyze and visualize our data and alert.

  • It’s an open source freemium tool which has approximately 40k stars on Github.
  • It combines with a wide variety of data sources including PostgreSQL, Elasticsearch, MongoDB, Github and even Google Sheets.
  • Used by Booking, Stack Overflow, eBay, Red Hat and many more.

How we use

We monitor tens of metrics and defined that many of alerts on Grafana. We monitor & alert on RabbitMQ server status, resource usage and queue metrics, Kubernetes pod restarts, MSSQL database status and some SQL queries.

In case you’re wondering which tool is the best, there’s no single tool that fits well for every use case. You may need more than one. You can monitor application throughputs with New Relic, pod restarts with Grafana, 3rd party services’ endpoints with PRTG.

Without further ado, let’s dive deep with Grafana.

2. Grafana

First we’ll take a look at a sample dashboard. And then we’ll talk about configuring it from zero, afterwards adding alerts to it which will be published to your desired Slack channel. Along the way we’ll see what Grafana can and can’t do.

For the examples, we will mainly focus on RabbitMQ. If your technology of choice to monitor is not RabbitMQ, know that you can apply the same practices for your tool.

Here’s our sample RabbitMQ dashboard.

Each panel above shows a metric we monitor. In near real-time, we can monitor RabbitMQ server status, channel and consumer counts, total message count, memory usage and free disk space.

As you can see there are 2 types of panels in this dashboard. The ones at the top shows a single value (“Up”, 6763, 3.3 Mil etc.) they are called singlestat panels. They show a single value in a series of data. The ones at the below are graph panels.

There’s an important difference between these panel types you should know. If you want to add alerts, you have to use graph panels. So even though they’re providing good visuals and history data, you’ll be needing them even when you don’t need their existence. This is a lack of feature of Grafana.

2.1 Monitoring with Grafana

Configuring a dashboard

How do we configure such a dashboard from zero?

It may look a bit complex at first but notice that a dashboard is just a collection of panels. We’ll first configure a single panel and the rest is about repetition of this process with minor changes.

Let’s start with a configuration of a panel.

Grafana provides highly customizable panels as well as high quality visuals.

To configure your metric from zero you’ll need to;

  • Get data: For that we’ll add a data source
  • Configure query: Pick a function that provides data for your metric
  • Customize panel: To make our metric more expressive

Get data

To get your data, you should define a data source. This is a one time process. So after adding a data source you can use it in all your panels without any extra effort.

To add a datasource, head over to data sources page. Select your data source from the list and enter required information. Grafana supports various data sources and each of them has different options. You can check data sources page to find information for data source you’d like to use.

If you can’t find your datasource on the page above, the plugin may be missing. To solve this, head over to plugins page and install the related one to your Grafana.

We’ll use Prometheus for our RabbitMQ metrics because,

  • Prometheus integrates well with RabbitMQ.
  • It provides a vast amount of ready-to-use queries.

Prometheus has its own UI but it plays a better role as a backend for Grafana since Grafana provides better UI and UX.

Configure query

Now define your query for metric. To monitor the server status we use rabbitmq_up query. Grafana refreshes the panel automatically so you don’t need to.

Here’s the result.

Customize panel

What we see is the current value, which is 1. So the server is up. 1 or 0 is understandable but we should make it more expressive.

Click the “Show options” button at the top right to see the available customizations.

The two great customizations you need in similar panels are value mappings and thresholds. You can map certain values and define colors for value ranges. That’s all for our server status panel.

From now on, we can add rest of the panels just by changing the query field.

Queries we use for the panels,

  • Server status: rabbitmq_up
  • Channel count: rabbitmq_channelsTotal
  • Node up stats:rabbitmq_running
  • Consumer count:rabbitmq_consumersTotal
  • Free disk space:rabbitmq_node_disk_free
  • Total messages: sum(rabbitmq_queue_messages)

You can also play around with queries with functions and even use regex to generate your custom queries.

  • Kubernetes pod restart counts: sum(kube_pod_container_status_restarts_total{namespace="your-name-space", pod=~"your-pod-name"}
  • RabbitMQ memory usage: 100 * (rabbitmq_node_mem_used / rabbitmq_node_mem_limit)
  • Top 5 defer queues by message count: topk(5, rabbitmq_queue_messages{queue=~".*defer"})

So far so good. We have a server status panel. It shows a green colored “Up” and red colored “Down” messages.

2.2 Alerting with Grafana

We want to get notified when the server is down. How do we do it?

Alerting is the most important feature of Grafana in my opinion. Having able to check metrics whenever you want is great but you should not need to constantly check metrics to find out if something is wrong. Instead, Grafana should tell you something is wrong and you can check out the metrics then.

Adding an alert from zero is a two step process,

  • Define a notification channel
  • Add alert for your panel

Let’s see.

Define a notification channel

You can get notifications via Email, Discord, Slack, LINE and so on.

We like Slack in Trendyol. So we’ve used Slack as the notification channel. When adding a channel, you use your Slack Webhook URL and your channel name. You should write your Slack channel name (say “delivery-alerts”) to recipient section.

The two options worth mentioning are send reminders and include images. If you’re new to Grafana, these can be a bit detaily. You can skip them at the beginning but may need them in future.

Send reminders

A feature to periodically send alert message to remind you there’s something wrong.

Say you set the period to 3 minutes and you get server is down alert at 3:00 PM. You’ll get “Server is down” alert again at 3:03 PM, 03:06 PM so on; until the server is up.

This feature is useful for less critical alarms so in that case you can set period to 6 hours.

Beware that you are defining reminders per notification channel. Not per alert. Thus, you will get reminders for every alert you add to this notification channel.

Say you have 8 alerts defined for your Slack channel (“delivery-alerts”) but you only want to get reminder for a specific alert (eg. “message count alert”). In this case, you should add another notification channel for your Slack channel.

Define two Grafana notification channels for your Slack channel

The two configurations are the same except the send reminders option. When selecting an alert channel we can pick delivery-alerts-with-reminder if we want reminders.

Include images

Adding images to your alerts is possible but not that possible

You have the option show images of your panels on your alerts. This is a useful feature but it doesn’t work just by enabling it. This rather simple feature requires defining an image storage provider and a lot of configuration work. Grafana definitely needs an improvement here.

Adding alerts to your panels

How do we add our “If server is down, send an alert to Slack”?

Remember we can’t add alerts to singlestat panels so we should use a graph panel to add alerts.

To change our panel type, head over to the options and pick the one you want. Notice that “Alerts” tab becomes visible when we use a Graph panel.

Here’s the basics of alert customization;

  • Conditions: Add a condition(s) so that when they are evaluated as TRUE, your metric will be alerting.
  • Notifications: Select which notification channel (a Slack channel in our case) your alert message should be sent to
  • Message: To add some extra information in your alert message
  • Rule: Evaluate every is the frequency your alert condition be checked. For is the waiting duration before sending an alert, 0m means disabled
Server Status Alert

This is the exact alert configuration we use in our Server Status panel. Every 5 seconds it checks if server status is 0 and when so, it sends an “Server is down” alert to delivery-alerts Slack channel.

In the send to section, if we want periodical reminder messages we can select delivery-alerts-with-reminder option that we’ve discussed at the Send reminders feature.

Condition functions

Let’s explore conditions section.

  • At the query section, we define the metric we want to add an alert to. We also specify a range.
  • Query section is structured as query(metricName, start, end) soquery(A, 1m, now) means that it will work on metric A, will check the range from 1 minute ago to now. A is the name assigned to the rabbitmq_up metric we’ve added.
  • last function picks the latest value in the range. Since our query is defined from 1 minute ago to now, it’ll just fetch the current server status.
Our alert on Slack

That’s all for adding an alert. Now we have a two panels, one singlestat and one graph. We’ll get message on our Slack channel when the server is down. From this point on, you can repeat the same process with minor tweaks (such as changing your query and value mappings) to create your unique dashboards and define them alerts.

2.3 Additional features

Let’s briefly talk about the two of the useful features of Grafana.

Ready-to-use dashboards

There are more than 3000 dashboards that can be easily imported to your Grafana. When you find a dashboard you like, you just copy the id (for eg. 2343) and paste it in the import section in your Grafana. With little-to-no configuration your dashboard will be ready. This one is a quite powerful feature of Grafana.

Grafana API

Grafana has a HTTP API with countless endpoints. Considering that each panel and dashboard is a JSON object, this means that there’s a countless way to customize and manage your Grafana. For example, using Dashboard API, you can execute CRUD operations on your dashboards.

It’s also worth mentioning that Grafana is evolving pretty quickly. They’re releasing new features and bug fixed actively. So it’s definitely worth hopping on the train.

2.4 Lack of features

Lastly, we’ll talk about some disadvantages and lack of features you can encounter while using Grafana. We’ll also talk about which version to use.

A bug on an alerting method: Earlier versions of Grafana has a bug on diffand percent_diff functions. Say you want to define an alarm for your 20% increase in message count in the last 30 minutes. Our condition will be the following;percent_diffof query(A, 30m, now) is above 20. It will work quite well with the 20% increase but you will still get the alert for 20% decrease which can result in pretty annoying unwanted alerts. Fortunately this bug is fixed.

Security issue: Grafana has had a security issue in the older versions. Thus it’s advised to use 7.0.2 or higher.

Not customizable messages: there are very limited options for customizing your alert message. You can’t use styling or variables in the message section. You can’t add buttons or else, even though Slack is supporting these.

No scheduling: You can’t turn off alerts. You can manually pause them one by one but can’t schedule or temporarily disable them. You will be getting alerts after midnight or in Sunday morning.

No warning state: Alerting is a boolean state. There’s no warning alert you can send to your Slack. It’s either alerting (red) or not (green). There’s a pending state too, but it’s not the same as warning state.

Lack of endpoint monitoring & alert: You can’t simply check status of your URL endpoints with Grafana. There’s a plugin for that but it’s limited to 3 endpoints. Otherwise you’ll need to pay $100 monthly. Thats why we use PRTG tool for this need.

Conclusion

Being one of the most popular monitoring tools, Grafana is extremely useful tool with great visuals. It helped us to monitor and manage alerts of various technologies (from RabbitMQ to Kubernetes, MSSQL and Elasticsearch) in one place. It’s easy to learn and use. No surprise a lot of great companies are using it. Even though it’s evolving quickly, it also lacks some basic features as we’ve discussed. Overall, Grafana is a quite powerful tool you should definitely consider using it in your company.

Happy coding!

Trendyol Tech

Trendyol Tech Team