Trendyol Tech
Published in

Trendyol Tech

Alert and Monitoring with Grafana

Trendyol is one of the most popular e-commerce websites in Turkey. At any given time, there are a vast number of customers surfing on the website and apps. This means that hundreds of thousands of requests are generated for our services every second.

To ensure high availability and seamless customer experience, we should closely monitor our services and interfere when there’s a problem. This is where alert and monitoring tools come into play. Thanks to these tools, we can analyze the status of our services and get a notification in case of any failure.

Today, we’ll take a glance at some of the most popular monitoring tools, their features, and our use cases, and we’ll discuss Grafana in depth. After reading the article, various monitoring tools will be introduced to you, and you will learn what Grafana can and can’t do for you with examples. This will help you decide whether or not you should use Grafana.

Topics
1. Alert and Monitoring tools
1.1 Kibana
1.2 New Relic
1.3 PRTG
1.4 Prometheus
1.5 Delivery Alerts
1.6 Grafana
2. Grafana
2.1 Monitoring with Grafana
2.2 Alerting with Grafana
2.3 Additional features
2.4 Lack of features
3. Conclusion

1. Alert & Monitoring tools

When you place an order at Trendyol, we designate a cargo company such as Aras Kargo, Yurtiçi Kargo for it. Thereon, we inform their services about your order. Until your order is delivered to you, you can track your order: whether it’s shipped or arrived at the cargo office.

This process requires the integration of multiple microservices. Technically, we have a custom microservice for each cargo company. Each of them listens to multiple RabbitMQ queues. So, we need to monitor the queues’ statuses, RabbitMQ server status, microservices’ pod statuses and resource utilization for this process.

For this and other similar needs, we use Kibana, Prometheus, New Relic, PRTG, and Grafana. Each one serves a different purpose in our team. We’ll briefly discuss what these tools are, what are their pros and cons, and our use cases.

1.1 Kibana

Kibana is a great UI tool for Elasticsearch. Elasticsearch is an excellent database if your primary use case is searching for data.

  • Using Kibana, you can create tables, graphs, and dashboards for your data.
  • Kibana is the best monitoring tool for Elasticsearch. However, it is just Elasticsearch oriented. So, you can’t integrate it with other data sources such as MSSQL, Couchbase, and RabbitMQ as you do with Grafana.

How we use

Each of your orders is shipped to you in one or several packages (we call them “shipments”, or “gönderi” in Turkish). We use Kibana to visualize how many shipments we have had in the recent weeks and also by cargo company. Since there is a vast number of shipments, the use of Elasticsearch enables us to query data quickly, and Kibana is the best tool to visualize Elasticsearch data.

1.2 New Relic

New Relic is a pioneering monitoring & alert tool by which you can access a large amount of data about your services.

  • You can monitor response time, throughput, errors, CPU and RAM usage of each Kubernetes pod and much more. You can analyze your endpoints’ performance only by using Transactions feature.
  • You can add alerts to your services and integrate them with other services. For example, when the error count goes above 5% for 3 minutes, you can get a message on your Slack with the error graph.

How we use

Error percentage alert from New Relic on Slack

We use New Relic for almost all of our services. We also pay attention to adding alerts to our services. During a regular day, New Relic notifies us via Slack messages concerning whether there’s a high error rate or slow response time for each of our services.

1.3 PRTG

PRTG is a network-oriented monitoring tool.

  • You can monitor a wide variety of metrics with PRTG. Some examples are as follows; uptime, load, CPU/RAM and disk usage, database requests, and endpoints’ status.

How we use

Free memory alert from PRTG on Slack

We use PRTG to monitor cargo company services (to assure you can track your shipment on the app), and resource usage.

1.4 Prometheus

Prometheus is a tool to analyze your data on other data sources (such as RabbitMQ and Kubernetes).

  • When you integrate your Prometheus with other services, you get ready for using functions. For example, you can see your total message count on RabbitMQ in near real-time.
  • Prometheus doesn’t have a great UI. Fortunately, you can integrate it with Grafana and get to use the functions on Grafana.

How we use

We use Prometheus as a query provider for Grafana. Since Grafana integrates well with Prometheus, we can access all Prometheus functions on Grafana.

1.5 Delivery Alerts

This is a scheduler project we’ve developed on Java 11 to define our custom alerts which are not provided by other tools.

  • This project is listed to point out that using a tool is not always the best solution. You can develop your project for your alerting requirements.
  • Each day, the scheduler runs custom queries on Elasticsearch and produces reports and sends them on a custom Slack channel.
  • We can also define specific alerts for the MSSQL database since Grafana SQL-like queries are not that great in terms of development and ease of use.
  • We also integrate some of our other projects with Slack, so in some cases, we send a Slack message when a custom exception is thrown.

1.6 Grafana

Grafana is a popular tool to analyze and visualize our data and alert.

  • It’s an open source freemium tool that has approximately 40k stars on Github.
  • It combines with a wide variety of data sources, including PostgreSQL, Elasticsearch, MongoDB, Github, and even Google Sheets.
  • It is used by Booking, Stack Overflow, eBay, Red Hat and many more.

How we use

We monitor tens of metrics and many defined alerts on Grafana. We monitor & alert on RabbitMQ server status, resource utilization and queue metrics, Kubernetes pod restarts, MSSQL database status, and some SQL queries.

If you are wondering which tool is the best, there’s no single tool that fits well for every use case. You may need more than one. You can monitor application throughputs with New Relic, pod restarts with Grafana, 3rd party services’ endpoints with PRTG.

Without further ado, let’s dive deep with Grafana.

2. Grafana

First we’ll take a look at a sample dashboard. And then, we’ll talk about configuring it from zero and after then adding alerts to it, which will be published to your desired Slack channel. Along the way we’ll see what Grafana can and can’t do.

For the examples, we will mainly focus on RabbitMQ. If your technology choice to monitor is not RabbitMQ, remember that you can employ the same principles for your tool.

Here’s our sample for the RabbitMQ dashboard.

Each panel given above displays a metric we monitor. In near real-time, we can monitor RabbitMQ server status, channel and consumer counts, total message count, memory usage, and free disk space.

As you can see, there are 2 types of panels in this dashboard. The ones at the top display a single value (“Up”, 6763, 3.3 Mil, etc.), and they are called singlestat panels. They display a single value in a series of data. The ones below are graph panels.

There’s a significant difference between these panel types that you should know. If you want to add alerts, you have to use graph panels. So, even though they’re providing good visuals and history data, you will need them even though you don’t need their existence. This is a lack of feature for Grafana.

2.1 Monitoring with Grafana

Configuring a dashboard

How do we configure such a dashboard from zero?

It may look a bit complex at firstü but notice that a dashboard is just a collection of panels. Fist we will configure a single panel and the rest is the repetition of this process with minor changes.

Let’s start with the configuration of a panel.

Server Status Panel

Grafana provides highly customizable panels as well as high-quality visuals.

To configure your metric from zero, you’ll need to;

  • Get data: For this we will add a data source
  • Configure query: Pick a function that provides data for your metric
  • Customize panel: To make our metric more expressive

Get data

To get your data, you should define a data source. This is a one time process. So after adding a data source you can use it in all your panels without any extra effort.

Grafana supports various data sources

To add a datasource, head over to the data sources page. Select your data source from the list and enter the required information. Grafana supports various data sources and each of them has different options. You can check the data sources page to find information for the data source you’d like to use.

If you can’t find your datasource on the page above, the plugin may be missing. To solve this, head over to the plugins page and install the related one to your Grafana.

We’ll use Prometheus for our RabbitMQ metrics because,

  • Prometheus integrates well with RabbitMQ.
  • It provides a vast amount of ready-to-use queries.

Prometheus has its own UI, but it plays a better role as a backend for Grafana since Grafana provides better UI and UX.

Configure query

Now, define your query for metric. To monitor the server status, we use the rabbitmq_up query. Grafana refreshes the panel automatically, so you don’t need to do it.

Here’s the result.

Adding a query to our panel

Customize panel

What we see is the current value, which is 1. So, the server is up. 1 or 0 is understandable, but we should make it more expressive.

Click the “Show options” button at the top right to see the available customizations.

Customizations for our panel

Two great customizations you need in similar panels are value mappings and thresholds. You can map certain values and define colors for value ranges. That’s all for our server status panel.

From now on, we can add the rest of the panels just by changing the query field.

Queries we use for the panels,

  • Server status: rabbitmq_up
  • Channel count: rabbitmq_channelsTotal
  • Node up stats:rabbitmq_running
  • Consumer count:rabbitmq_consumersTotal
  • Free disk space:rabbitmq_node_disk_free
  • Total messages: sum(rabbitmq_queue_messages)

You can also play around with queries with functions and even use regex to generate your custom queries.

  • Kubernetes pod restart counts: sum(kube_pod_container_status_restarts_total{namespace="your-name-space", pod=~"your-pod-name"}
  • RabbitMQ memory usage: 100 * (rabbitmq_node_mem_used / rabbitmq_node_mem_limit)
  • Top 5 defer queues by message count: topk(5, rabbitmq_queue_messages{queue=~".*defer"})

So far, so good. We have a server status panel. It displays a green colored “Up” and red colored “Down” messages.

2.2 Alerting with Grafana

We want to get a notification when the server is down. How do we do it?

In my opinion, alerting is the most important feature of Grafana. The ability to check metrics whenever you want is great, but you should not need to constantly check metrics to find out if something is wrong. Instead, Grafana should tell you something is wrong and you can check out the metrics then.

Adding an alert from zero is a two step process:

  • Define a notification channel
  • Add an alert for your panel

Let’s see.

Define a notification channel

You can get notifications via email, Discord, Slack, LINE, and so on.

Adding a notification channel

We like Slack in Trendyol. So, we’ve used Slack as the notification channel. When adding a channel, you use your Slack Webhook URL and your channel name. You should write your Slack channel name (say “delivery-alerts”) to the recipient section.

The worth-mentioning two options are to send reminders and include images. If you’re new to Grafana, these can be a bit detailed. You can skip them at the beginning but may need them in the future.

Send reminders

A feature to periodically send alert messages to remind you that there’s something wrong.

Say you set the period to 3 minutes and you get an alert indicating that a server is down alert at 3:00 PM. You will get a “Server is down” alert again at 3:03 PM, 03:06 PM and so on; until the server is up.

This feature is useful for less critical alarms so in that case, you can set period to 6 hours.

Beware that you are defining reminders per notification channel, not per alert. Thus, you will get reminders for every alerts you add to this notification channel.

Say that you have 8 alerts defined for your Slack channel (“delivery-alerts”) but you only want to get a reminder for a specific alert (e.g. “message count alert”). In this case, you should add another notification channel for your Slack channel.

Define two Grafana notification channels for your Slack channel

The two configurations are the same except for the send reminders option. When selecting an alert channel, we can pick delivery-alerts-with-reminder if we want reminders.

Include images

Adding images to your alerts is possible, but not that possible

You have the option to show images of your panels on your alerts. This is a helpful feature, but it doesn’t work just by enabling it. This rather simple feature requires defining an image storage provider and a lot of configuration work. Grafana definitely needs improvement here.

Add an alert to your panels

How do we add our “If server is down, send an alert to Slack”?

Remember that we can’t add alerts to singlestat panels, so we should use a graph panel to add alerts.

To change our panel type, head over to the options and pick the one you want. Notice that the “Alerts” tab becomes visible when we use a Graph panel.

Adding an alert to a panel

Here are the basics of alert customization;

  • Conditions: Add a condition(s) so that when they are evaluated as TRUE, your metric will be alerting.
  • Notifications: Select the notification channel (a Slack channel in our case) which your alert message should be sent.
  • Message: To add some extra information to your alert message.
  • Rule: “Evaluate every” is the frequency your alert condition be checked. “For” is the waiting duration before sending an alert, 0m means disabled.
Server Status Alert

This is the exact alert configuration we use in our Server Status panel. Every 5 seconds, it checks if server status is 0, and when so, it sends an “Server is down” alert to the delivery-alerts Slack channel.

In the send to section, if we want periodical reminder messages, we can select the delivery-alerts-with-reminder option that we’ve discussed at the Send reminders feature.

Condition functions

Let’s explore the conditions section.

  • In the query section, we define the metric to which we want to add an alert. We also specify a range.
  • Query section is structured as query(metricName, start, end) andquery(A, 1m, now) means that it will work on metric A, and will check the range from 1 minute ago to now. A is the name assigned to the rabbitmq_up metric we’ve added.
  • last function picks the latest value in the range. Since our query is defined from 1 minute ago to now, it’ll just fetch the current server status.
Our alert on Slack

That’s all for adding an alert. Now, we have a two panels; one singlestat and one graph. We’ll get message on our Slack channel when the server is down. From this point on, you can repeat the same process with minor tweaks (such as changing your query and value mappings) to create your unique dashboards and add them alerts.

2.3 Additional features

Let’s briefly talk about the two of the beneficial features of Grafana.

Ready-to-use dashboards

There are more than 3000 dashboards that can be easily imported to your Grafana. When you find a dashboard you like, you just copy the id (for e.g. 2343) and paste it in the import section in your Grafana. With little-to-no configuration, your dashboard will be ready. This one is a quite powerful feature of Grafana.

Grafana API

Grafana has an HTTP API with countless endpoints. Considering that each panel and dashboard is a JSON object, this means that there’s a countless way to customize and manage your Grafana. For example, you can execute CRUD operations on your dashboards using Dashboard API.

It’s also worth mentioning that Grafana is evolving pretty quickly. They’re releasing new features and bug fixed actively. So, it’s definitely worth hopping on the train.

2.4 Lack of features

Lastly, we’ll talk about some disadvantages and lack of features you can encounter when using Grafana. We’ll also talk about which version to use.

A bug on an alerting method: Earlier versions of Grafana has a bug on diffand percent_diff functions. Say that you want to define an alarm for your 20% increase in message count in the last 30 minutes. Our condition will be as follows;percent_diffof query(A, 30m, now) is above 20. It will work quite well with the 20% increase, but you will still get the alert for 20% decrease which can result in pretty annoying unwanted alerts. Fortunately this bug is fixed.

Security issue: Grafana has had a security issue in the older versions. Thus it’s advised to use 7.0.2 or higher.

Not customizable messages: there are very limited options for customizing your alert message. You can’t use styling or variables in the message section. You can’t add buttons or something else, even though Slack is supporting these.

No scheduling: You can’t turn off alerts. You can manually pause them one by one, but can’t schedule or temporarily disable them. You will get alerts after midnight or in Sunday morning.

No warning state: Alerting is a boolean state. There’s no warning alert you can send to your Slack. It’s either alerting (red) or not (green). There’s a pending state too, but it’s not the same as a warning state.

Lack of endpoint monitoring & alert: You can’t simply check the status of your URL endpoints with Grafana. There’s a plugin for that, but it’s limited to 3 endpoints. Otherwise, you’ll need to pay $100 monthly. Thats why we use PRTG tool for this need.

Conclusion

Being one of the most popular monitoring tools, Grafana is extremely beneficial tool with great visuals. It helped us to monitor and manage the alerts of various technologies (varying from RabbitMQ to Kubernetes, MSSQL, and Elasticsearch) in one place. It’s easy to learn and use. It is no surprise a lot of great companies use it. Even though it’s evolving quickly, it also lacks some basic features we’ve discussed. Overall, Grafana is a very powerful tool that you should consider using in your company.

Happy coding!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store