Alert and Monitoring with Grafana
Trendyol is one of the most popular e-commerce websites in Turkey. At any given time, there are a vast number of customers surfing on the website and apps. This means that hundreds of thousands of requests are generated for our services every second.
To ensure high availability and seamless customer experience, we should closely monitor our services and interfere when there’s a problem. This is where alert and monitoring tools come into play. Thanks to these tools, we can analyze the status of our services and get a notification in case of any failure.
Today, we’ll take a glance at some of the most popular monitoring tools, their features, and our use cases, and we’ll discuss Grafana in depth. After reading the article, various monitoring tools will be introduced to you, and you will learn what Grafana can and can’t do for you with examples. This will help you decide whether or not you should use Grafana.
Topics
1. Alert and Monitoring tools
1.1 Kibana
1.2 New Relic
1.3 PRTG
1.4 Prometheus
1.5 Delivery Alerts
1.6 Grafana
2. Grafana
2.1 Monitoring with Grafana
2.2 Alerting with Grafana
2.3 Additional features
2.4 Lack of features
3. Conclusion
1. Alert & Monitoring tools
When you place an order at Trendyol, we designate a cargo company such as Aras Kargo, Yurtiçi Kargo for it. Thereon, we inform their services about your order. Until your order is delivered to you, you can track your order: whether it’s shipped or arrived at the cargo office.
This process requires the integration of multiple microservices. Technically, we have a custom microservice for each cargo company. Each of them listens to multiple RabbitMQ queues. So, we need to monitor the queues’ statuses, RabbitMQ server status, microservices’ pod statuses and resource utilization for this process.
For this and other similar needs, we use Kibana, Prometheus, New Relic, PRTG, and Grafana. Each one serves a different purpose in our team. We’ll briefly discuss what these tools are, what are their pros and cons, and our use cases.
1.1 Kibana
Kibana is a great UI tool for Elasticsearch. Elasticsearch is an excellent database if your primary use case is searching for data.
- Using Kibana, you can create tables, graphs, and dashboards for your data.
- Kibana is the best monitoring tool for Elasticsearch. However, it is just Elasticsearch oriented. So, you can’t integrate it with other data sources such as MSSQL, Couchbase, and RabbitMQ as you do with Grafana.
How we use
Each of your orders is shipped to you in one or several packages (we call them “shipments”, or “gönderi” in Turkish). We use Kibana to visualize how many shipments we have had in the recent weeks and also by cargo company. Since there is a vast number of shipments, the use of Elasticsearch enables us to query data quickly, and Kibana is the best tool to visualize Elasticsearch data.
1.2 New Relic
New Relic is a pioneering monitoring & alert tool by which you can access a large amount of data about your services.
- You can monitor response time, throughput, errors, CPU and RAM usage of each Kubernetes pod and much more. You can analyze your endpoints’ performance only by using Transactions feature.
- You can add alerts to your services and integrate them with other services. For example, when the error count goes above 5% for 3 minutes, you can get a message on your Slack with the error graph.
How we use
We use New Relic for almost all of our services. We also pay attention to adding alerts to our services. During a regular day, New Relic notifies us via Slack messages concerning whether there’s a high error rate or slow response time for each of our services.
1.3 PRTG
PRTG is a network-oriented monitoring tool.
- You can monitor a wide variety of metrics with PRTG. Some examples are as follows; uptime, load, CPU/RAM and disk usage, database requests, and endpoints’ status.
How we use
We use PRTG to monitor cargo company services (to assure you can track your shipment on the app), and resource usage.
1.4 Prometheus
Prometheus is a tool to analyze your data on other data sources (such as RabbitMQ and Kubernetes).
- When you integrate your Prometheus with other services, you get ready for using functions. For example, you can see your total message count on RabbitMQ in near real-time.
- Prometheus doesn’t have a great UI. Fortunately, you can integrate it with Grafana and get to use the functions on Grafana.
How we use
We use Prometheus as a query provider for Grafana. Since Grafana integrates well with Prometheus, we can access all Prometheus functions on Grafana.
1.5 Delivery Alerts
This is a scheduler project we’ve developed on Java 11 to define our custom alerts which are not provided by other tools.
- This project is listed to point out that using a tool is not always the best solution. You can develop your project for your alerting requirements.
- Each day, the scheduler runs custom queries on Elasticsearch and produces reports and sends them on a custom Slack channel.
- We can also define specific alerts for the MSSQL database since Grafana SQL-like queries are not that great in terms of development and ease of use.
- We also integrate some of our other projects with Slack, so in some cases, we send a Slack message when a custom exception is thrown.
1.6 Grafana
Grafana is a popular tool to analyze and visualize our data and alert.
- It’s an open source freemium tool that has approximately 40k stars on Github.
- It combines with a wide variety of data sources, including PostgreSQL, Elasticsearch, MongoDB, Github, and even Google Sheets.
- It is used by Booking, Stack Overflow, eBay, Red Hat and many more.
How we use
We monitor tens of metrics and many defined alerts on Grafana. We monitor & alert on RabbitMQ server status, resource utilization and queue metrics, Kubernetes pod restarts, MSSQL database status, and some SQL queries.
If you are wondering which tool is the best, there’s no single tool that fits well for every use case. You may need more than one. You can monitor application throughputs with New Relic, pod restarts with Grafana, 3rd party services’ endpoints with PRTG.
Without further ado, let’s dive deep with Grafana.
2. Grafana
First we’ll take a look at a sample dashboard. And then, we’ll talk about configuring it from zero and after then adding alerts to it, which will be published to your desired Slack channel. Along the way we’ll see what Grafana can and can’t do.
For the examples, we will mainly focus on RabbitMQ. If your technology choice to monitor is not RabbitMQ, remember that you can employ the same principles for your tool.
Here’s our sample for the RabbitMQ dashboard.
Each panel given above displays a metric we monitor. In near real-time, we can monitor RabbitMQ server status, channel and consumer counts, total message count, memory usage, and free disk space.
As you can see, there are 2 types of panels in this dashboard. The ones at the top display a single value (“Up”, 6763, 3.3 Mil, etc.), and they are called singlestat panels. They display a single value in a series of data. The ones below are graph panels.
There’s a significant difference between these panel types that you should know. If you want to add alerts, you have to use graph panels. So, even though they’re providing good visuals and history data, you will need them even though you don’t need their existence. This is a lack of feature for Grafana.
2.1 Monitoring with Grafana
Configuring a dashboard
How do we configure such a dashboard from zero?
It may look a bit complex at firstü but notice that a dashboard is just a collection of panels. Fist we will configure a single panel and the rest is the repetition of this process with minor changes.
Let’s start with the configuration of a panel.
Grafana provides highly customizable panels as well as high-quality visuals.
To configure your metric from zero, you’ll need to;
- Get data: For this we will add a data source
- Configure query: Pick a function that provides data for your metric
- Customize panel: To make our metric more expressive
Get data
To get your data, you should define a data source. This is a one time process. So after adding a data source you can use it in all your panels without any extra effort.
To add a datasource, head over to the data sources page. Select your data source from the list and enter the required information. Grafana supports various data sources and each of them has different options. You can check the data sources page to find information for the data source you’d like to use.
If you can’t find your datasource on the page above, the plugin may be missing. To solve this, head over to the plugins page and install the related one to your Grafana.
We’ll use Prometheus for our RabbitMQ metrics because,
- Prometheus integrates well with RabbitMQ.
- It provides a vast amount of ready-to-use queries.
Prometheus has its own UI, but it plays a better role as a backend for Grafana since Grafana provides better UI and UX.
Configure query
Now, define your query for metric. To monitor the server status, we use the rabbitmq_up query. Grafana refreshes the panel automatically, so you don’t need to do it.
Here’s the result.
Customize panel
What we see is the current value, which is 1. So, the server is up. 1 or 0 is understandable, but we should make it more expressive.
Click the “Show options” button at the top right to see the available customizations.
Two great customizations you need in similar panels are value mappings and thresholds. You can map certain values and define colors for value ranges. That’s all for our server status panel.
From now on, we can add the rest of the panels just by changing the query field.
Queries we use for the panels,
- Server status:
rabbitmq_up
- Channel count:
rabbitmq_channelsTotal
- Node up stats:
rabbitmq_running
- Consumer count:
rabbitmq_consumersTotal
- Free disk space:
rabbitmq_node_disk_free
- Total messages:
sum(rabbitmq_queue_messages)
You can also play around with queries with functions and even use regex to generate your custom queries.
- Kubernetes pod restart counts:
sum(kube_pod_container_status_restarts_total{namespace="your-name-space", pod=~"your-pod-name"}
- RabbitMQ memory usage:
100 * (rabbitmq_node_mem_used / rabbitmq_node_mem_limit)
- Top 5 defer queues by message count:
topk(5, rabbitmq_queue_messages{queue=~".*defer"})
So far, so good. We have a server status panel. It displays a green colored “Up” and red colored “Down” messages.
2.2 Alerting with Grafana
We want to get a notification when the server is down. How do we do it?
In my opinion, alerting is the most important feature of Grafana. The ability to check metrics whenever you want is great, but you should not need to constantly check metrics to find out if something is wrong. Instead, Grafana should tell you something is wrong and you can check out the metrics then.
Adding an alert from zero is a two step process:
- Define a notification channel
- Add an alert for your panel
Let’s see.
Define a notification channel
You can get notifications via email, Discord, Slack, LINE, and so on.
We like Slack in Trendyol. So, we’ve used Slack as the notification channel. When adding a channel, you use your Slack Webhook URL and your channel name. You should write your Slack channel name (say “delivery-alerts”) to the recipient section.
The worth-mentioning two options are to send reminders and include images. If you’re new to Grafana, these can be a bit detailed. You can skip them at the beginning but may need them in the future.
Send reminders
A feature to periodically send alert messages to remind you that there’s something wrong.
Say you set the period to 3 minutes and you get an alert indicating that a server is down alert at 3:00 PM. You will get a “Server is down” alert again at 3:03 PM, 03:06 PM and so on; until the server is up.
This feature is useful for less critical alarms so in that case, you can set period to 6 hours.
Beware that you are defining reminders per notification channel, not per alert. Thus, you will get reminders for every alerts you add to this notification channel.
Say that you have 8 alerts defined for your Slack channel (“delivery-alerts”) but you only want to get a reminder for a specific alert (e.g. “message count alert”). In this case, you should add another notification channel for your Slack channel.
The two configurations are the same except for the send reminders option. When selecting an alert channel, we can pick delivery-alerts-with-reminder if we want reminders.
Include images
Adding images to your alerts is possible, but not that possible
You have the option to show images of your panels on your alerts. This is a helpful feature, but it doesn’t work just by enabling it. This rather simple feature requires defining an image storage provider and a lot of configuration work. Grafana definitely needs improvement here.
Add an alert to your panels
How do we add our “If server is down, send an alert to Slack”?
Remember that we can’t add alerts to singlestat panels, so we should use a graph panel to add alerts.
To change our panel type, head over to the options and pick the one you want. Notice that the “Alerts” tab becomes visible when we use a Graph panel.
Here are the basics of alert customization;
- Conditions: Add a condition(s) so that when they are evaluated as TRUE, your metric will be alerting.
- Notifications: Select the notification channel (a Slack channel in our case) which your alert message should be sent.
- Message: To add some extra information to your alert message.
- Rule: “Evaluate every” is the frequency your alert condition be checked. “For” is the waiting duration before sending an alert, 0m means disabled.
This is the exact alert configuration we use in our Server Status panel. Every 5 seconds, it checks if server status is 0, and when so, it sends an “Server is down” alert to the delivery-alerts Slack channel.
In the send to section, if we want periodical reminder messages, we can select the delivery-alerts-with-reminder option that we’ve discussed at the Send reminders feature.
Let’s explore the conditions section.
- In the query section, we define the metric to which we want to add an alert. We also specify a range.
- Query section is structured as
query(metricName, start, end)
andquery(A, 1m, now)
means that it will work on metric A, and will check the range from 1 minute ago to now. A is the name assigned to therabbitmq_up
metric we’ve added. last
function picks the latest value in the range. Since our query is defined from 1 minute ago to now, it’ll just fetch the current server status.
That’s all for adding an alert. Now, we have a two panels; one singlestat and one graph. We’ll get message on our Slack channel when the server is down. From this point on, you can repeat the same process with minor tweaks (such as changing your query and value mappings) to create your unique dashboards and add them alerts.
2.3 Additional features
Let’s briefly talk about the two of the beneficial features of Grafana.
Ready-to-use dashboards
There are more than 3000 dashboards that can be easily imported to your Grafana. When you find a dashboard you like, you just copy the id (for e.g. 2343) and paste it in the import section in your Grafana. With little-to-no configuration, your dashboard will be ready. This one is a quite powerful feature of Grafana.
Grafana API
Grafana has an HTTP API with countless endpoints. Considering that each panel and dashboard is a JSON object, this means that there’s a countless way to customize and manage your Grafana. For example, you can execute CRUD operations on your dashboards using Dashboard API.
It’s also worth mentioning that Grafana is evolving pretty quickly. They’re releasing new features and bug fixed actively. So, it’s definitely worth hopping on the train.
2.4 Lack of features
Lastly, we’ll talk about some disadvantages and lack of features you can encounter when using Grafana. We’ll also talk about which version to use.
A bug on an alerting method: Earlier versions of Grafana has a bug on diff
and percent_diff
functions. Say that you want to define an alarm for your 20% increase in message count in the last 30 minutes. Our condition will be as follows;percent_diff
of query(A, 30m, now)
is above 20
. It will work quite well with the 20% increase, but you will still get the alert for 20% decrease which can result in pretty annoying unwanted alerts. Fortunately this bug is fixed.
Security issue: Grafana has had a security issue in the older versions. Thus it’s advised to use 7.0.2 or higher.
Not customizable messages: there are very limited options for customizing your alert message. You can’t use styling or variables in the message section. You can’t add buttons or something else, even though Slack is supporting these.
No scheduling: You can’t turn off alerts. You can manually pause them one by one, but can’t schedule or temporarily disable them. You will get alerts after midnight or in Sunday morning.
No warning state: Alerting is a boolean state. There’s no warning alert you can send to your Slack. It’s either alerting (red) or not (green). There’s a pending state too, but it’s not the same as a warning state.
Lack of endpoint monitoring & alert: You can’t simply check the status of your URL endpoints with Grafana. There’s a plugin for that, but it’s limited to 3 endpoints. Otherwise, you’ll need to pay $100 monthly. Thats why we use PRTG tool for this need.
Conclusion
Being one of the most popular monitoring tools, Grafana is extremely beneficial tool with great visuals. It helped us to monitor and manage the alerts of various technologies (varying from RabbitMQ to Kubernetes, MSSQL, and Elasticsearch) in one place. It’s easy to learn and use. It is no surprise a lot of great companies use it. Even though it’s evolving quickly, it also lacks some basic features we’ve discussed. Overall, Grafana is a very powerful tool that you should consider using in your company.
Happy coding!