Hey, something’s going on…

Franco martin
Getting started with the ELK Stack
5 min readOct 27, 2020

The current state of affairs

So you got this far into the series, nice. You have your Elasticsearch cluster, your beats sending data and you got Kibana all set up with pretty dashboards. Thats al fine and dandy but its a saturday night and a bad system update brought down one of your nodes and the other two are barely handling the load.

Some background first

ODFE comes with the alerting plugin preinstalled. This allows us to set up alerts when certain conditions are met.

There are three ways to set up alerts in ODFE: Visual graphs, Extraction queries and the anomaly detector.

The first one is pretty self explanatory so we will focus on Extraction queries. The anomaly detector is an interesting subject but way out of the scope of this series.

Alerts have three important sections: monitors, triggers and actions. A monitor is a query that is run every certain amount of time, each monitor has a number of triggers. A trigger is a condition that is evaluated based on the response of the extraction query of the monitor. When a trigger condition evaluates to true, a number of actions will be executed.

In simple terms:

  • Monitor: How many unread emails do I have?
  • Trigger: Emails > 5
  • Action: Send slack message

Creating a destination

Before we begin, we need to set up a destination for notifications. In the Other useful information section you will find a link to add an Incoming Webhook app to Slack. Take note of the webhook url.

Go to the Alerting section, go to the Destinations tab and click on “Add destination”. Give it a name, select “Slack” as the type and paste your webhook url there.

ODFE recently released email notifications and it also supports custom webooks for other messaging apps like teams.

Once you are done, click on “Create”.

Setting up a Monitor

To begin with alerts we need a query. A basic Elastcsearch query looks something like this:

Average of memory usage in the last 5 minutes

Fortunately, you dont need to write all this for every query. The visualization section is basically a visual query editor.

Since what we want is a simple number, we can use the metric visualization. Change the aggregation to “average”, select “system.memory.actual.used.pct” as the field, click update and you are basically done.

Now we need to get the query. Close to the “Save” button, there will be an “inspect” button. Click it and change the top right option with the name “View: Data” to “View: Requests” by clicking it and selecting the option from the dropdown. Kibana will show you three tabs: Statistics (very useful to measure query performance), Request and Response. Click on Request and then on the copy icon on top right section of the request tab.

Go ahead an paste that query into your favourite code editor.

Now we can procede with the creation of an alert. Go to the Alerting section and click on the Monitors tab. Click on create monitor, give it a name and select “Define using extraction query” from the “Method of definition” dropdown.

In the index dropdown, type “metricbeat-*” and press enter. This will split the space below into two. On the left you will define a query and on the right you’ll see the response after you click on “Run”.

Replace the existing query with the one you got from the visualization and click on the “Run” button on the top right corner.

Your response should look something like this:

The problem with this is that if you run it an hour from now, the result will be the same. This is because in our extraction query, the dates are static.

Here is our problem:

Static time

To fix this, modify the query to look something like this:

Dynamic time

So if you press the “Run” button several times, you’ll see that the value changes.

Scroll down and click on “Create”, this will take you to the “Create Trigger” screen.

Creating a Trigger

To begin, set the trigger name and select a Severity level (this is arbitrary). Below you’ll see the response for the Monitor query, and just under the response you’ll have the Trigger Condition. By default the trigger condition will be “ctx.results[0].hits.total.value > 0”.

For now we will focus on everything that goes after “ctx.results[0]”. The default query evaluates if the amount of documents that the query read is bigger than 0. This is of no use for us, we need the memory usage. As you can see our response has the field aggregations.1.value wich is what we want, so change your trigger condition to this:

“ctx.results[0].aggregations.1.value > 0”

Scroll down a bit and click on the “Run” button and watch the Trigger condition response change to “true”. Now set the value after the “>” sign to something like 0.8 for 80%, because memory will always be over 0%.

Setting up an Action

Below the “Run” button you will be able to configure actions. Here you can notify a Destination with a message.

First, give the action a name and select the destination you created earlier. Type a subject for the message and scroll down a bit. Here you will see the message preview.

We want to include variables in our messages so we dont need to create a trigger for every server we have. To include a variable from the response in the message use the following template:

Sample message for memory usage

In the right side of the screen there is a “Send test message” button to test your destination.

Finally, to avoid a notification flood when something goes wrong click the “Enable action throttling” checkbox and enter a reasonable number of minutes to wait before sending another notification.

To conclude, click on create.

Next steps

In the next and final installment of this series we will review some performance and availability settings and I’ll share some things I learned the hard way.

See other related articles.

Other useful information

Slack Incoming Webhooks

--

--

Franco martin
Getting started with the ELK Stack

Im a solutions architect, passionate about scalable and maintainable architectures.