Alerting and how 50 lines of code changed how we do it.

Published in

SwissBorg Engineering

9 min readJan 26, 2023

Sometimes things go bad, that’s how it is. Regardless of how much effort we put into preventing it. And when they do, it’s crucial to be notified and react as quickly as possible to mitigate the impact.

This article briefly describes how we handle this aspect of software development at SwissBorg and how a few lines of code changed our approach.

TL;DR

SwissBorg relies on the Prometheus Alertmanager and ElastAlert for alerting.
Historically, each ElastAlert alert had to be configured separately.
We introduced a small utility that allowed us to increase the number of alerting cases by two orders of magnitude.

Infrastructure

This article focuses mostly on alerting from the perspective of app development. To illustrate this properly, we first need to describe the infrastructure enabling all further efforts.

We rely on two main alert sources: metrics and logs. Metrics are handled by Prometheus which triggers alerts through its Alertmanager. Logs are stored in OpenSearch and trigger alerts through ElastAlert 2. Below you can see what the configuration of these alerts looks like.

parsing-errros:
 name: parsing errors alert
 index: ...
 type: any
 filter:
   - query_string:
       query: (message:CirceDecodingError OR message:DecodingFailure)
 alert:
   - opsgenie:
       opsgenie_priority: P2
       opsgenie_key: ...
       opsgenie_subject: "{} - {} decoding errors"
       opsgenie_subject_args:
         - "kubernetes.pod_namespace"
         - "kubernetes.pod_labels.app"
       alert_text_type: exclude_fields
       alert_text: |-
         !!Parsing error detected!!
         Message: {0}
         stack_trace:
         {1}
       alert_text_args:
         - "message"
         - "stack_trace"
 realert:
   hours: 24 # we realert every day as this expected timeframe for parsing/decoding adjustment

This alert notifies the team about any parsing errors coming from circe library that were not handled explicitly and showed up in logs. As you can see, we rely on the OpsGenie for the notifications and later parts of the alerting process.

- alert: OutgoingUnexpectedTransactions
 expr: outgoing_transactions_terminal{state=~"REJECTED|TIMEOUT"} > 0
 for: 0m
 labels:
   severity: info
   team: payment-engineers
 annotations:
   Summary: Transactions in unusual state {{ $labels.state }}
   Description: |
     Number of txs: {{ $value }}
     Normally transactions in those states don’t show up. We should investigate.

This alert monitors the number of transactions we receive that were in unexpected states, and notifies us when it happens. OpsGenie configuration is done separately and shared between the alerts.

With these two channels, whenever we want to raise a new alert, we have to publish a relevant metric or need to output a logline, and then configure the related tool to pick it up and raise the alert.

When this is not enough

There is one ultimate truth that will accompany us through the rest of this article:

“It doesn’t matter what is possible but what actually happens.” — The Universe

With the basic capabilities we have seen, creating a new alert requires continuous effort and a somewhat significant amount of work.

At the same time, each service of ours has tens, and sometimes hundreds of places that should notify the engineers. When writing applications we need to make a number of assumptions about the world, our service providers and the behavior of other components.

Let’s see a few examples of these assumptions.

The deposited amount must be a positive number, as in there must be a deposited balance more than 0. If not, it means there is a bug somewhere or we haven’t accounted for an expected scenario.
Data returned by the providers is valid — e.g. IBANs follow the IBAN format. If the data is not valid we need to understand why.
We know of all the possible values of the ‘transaction status’ field. If we see a new value we should account for it, not to miss any potential scenarios.
Dynamic configuration must match a static configuration. We use two sources of configuration for different purposes, but sometimes we need to cross validate the two during the lifetime of the application.

Keep in mind that in all these and similar cases, we have to keep the service alive and functioning, through various means. But this doesn’t solve the issue that we still want to know whenever those problems occur. The bottom line is this; in the long term we would like to be notified about any broken assumptions we’ve made.

The simplest solution to this conundrum is to define an alert for each problem by configuring a relevant elastic query. The issue with this approach is that it’s still too much work and is error prone, as one would have to react to log-messages or other parts of log-entries that can change over time. With this approach you could end up in the situation where you’ve improved a logline, but inadvertently broken the alerting.

An alternative solution is to simply alert on any ‘ERROR’ log. Unfortunately, this approach is quite inflexible when it comes to properly selecting errors that are ‘expected’ — e.g. external api outages — and selecting those that are not.

The alternative approach we have taken is to bring alerts closer to code.

How we log

Before we dive into the solution, we need to shed some light on a few other pieces of infrastructure we use for logging.

First of all, we use structured logging with JSON as its format and we do this with the use of logstash-logback-encoder. This plugin to Logback gives us the ability to output JSON messages and control how they look. We use Logback through its Slf4j API, and scala-logging as a wrapper on top of it.

All this being stated, we now have enough context to further explore our alerting solution.

Bringing alerts closer to code

What was our goal again? Ah yes, to be able to raise new alerts and control their behavior, with as little effort as possible.

To achieve this we leverage LogstashMarker, which is an instance of Slf4j Marker.

What does this mean? It means we can write:

import com.typesafe.scalalogging.Logger
import net.logstash.logback.marker.Markers

val logger = Logger("my-logger")
logger.warn(Markers.append("alert", "true"), "Message")

And the output JSON will contain an alert field with a true value. The next step is to create a generic alert that will react to such marked log messages. (In the example below we’ve omitted some irrelevant fields.)

payments-generic-alert:
 name: Generic log alert=
 filter:
   - query_string:
       query: >-
         _exists_:alert
 alert:
   - opsgenie:
       opsgenie_priority: P1
       opsgenie_subject: "Generic: alert"
       opsgenie_subject_args:
         - "message"
       alert_text_type: exclude_fields
       alert_text: |-
         *{0}*
       alert_text_args:
         - "message"
 realert:
   hours: 2

Success! Now each properly annotated logline will be delivered to OpsGenie!

With the PoC done, let’s now explore what else we can do to make it actually usable.

The ultimate solution

After building on top of the example in the previous section, a basic concept presented itself. The idea was to create a very flexible and powerful alerting utility.

Let’s first introduce the implemented features, which give context to the code presented later in the article.

Control priority from code. Alerts have different priorities — from P1 to P5 in OpsGenie — and we would like to control them from the same place where we mark the logline.
Support for separate teams. We work in self-organising teams, each of which is free to put in place different alerting policies and processes. This means separate OpsGenie teams and dedicated alerts definitions.
Support for service ownership. We work in a multi-service environment with some teams owning certain services. Routing the alerts from these services to their dedicated teams should be simple.
Support for shared services. Some services are shared between different teams, and in this case we want to have the ability to route particular alerts to the responsible team.
Alerts separation. With the naive example above, only one OpsGenie alert will be created every 2 hours, even if the application outputs different loglines in this time frame. We want each logline to be treated separately.
Direct link to log browser. Once the alert is raised, it’s nice to have the ability to quickly navigate to Kibana, which we use to see all the remaining details.
Direct link to trace browser. Similarly to the above, we want to be able to navigate to Grafana Tempo/Jaeger to see the whole trace that resulted in the alert.
Providing details for an alert. Sometimes there is contextual information that should be delivered alongside the alert but which would clutter the log message if placed there.

To facilitate these requirements we implement a tiny wrapper around logstash markers and an accompanying alert template. Let’s see these 2 pieces of code.

First the Scala part:

import enumeratum._
import net.logstash.logback.marker.{LogstashMarker, Markers}
import sourcecode.{FileName, Line}

object AlertMarker {

 def p1: Builder = new Builder(Priority.P1)
 def p2: Builder = new Builder(Priority.P2)
 def p3: Builder = new Builder(Priority.P3)
 def p4: Builder = new Builder(Priority.P4)
 def p5: Builder = new Builder(Priority.P5)

 case class Team(value: String) extends AnyVal

 object Team {
   val payments: Team = Team("payments")
   val platform: Team = Team("platform")
 }

 class Builder(priority: Priority) {
   def apply(
              alertIdentifier: String = null,
              details: String = null,
              team: Team = Team(null)
            )(implicit file: FileName, line: Line): LogstashMarker = {
     val source              = s"${file.value}:${line.value}"
     val effectiveIdentifier = Option(alertIdentifier).getOrElse(source)
     Markers
       .aggregate(
         Internal.priority(priority),
         Internal.identifier(effectiveIdentifier),
         Internal.source(source),
         Option(details).fold(Markers.empty())(Internal.details),
         Option(team.value).fold(Markers.empty())(Internal.team)
       )
   }
 }

object Internal {
 def priority(priority: Priority): LogstashMarker        = Markers.append("sb_alert_priority", priority.entryName)
 def identifier(alertIdentifier: String): LogstashMarker = Markers.append("sb_alert_identifier", alertIdentifier)
 def source(source: String): LogstashMarker              = Markers.append("sb_alert_source", source)
 def details(details: String): LogstashMarker            = Markers.append("sb_alert_details", details)
 def team(details: String): LogstashMarker               = Markers.append("sb_alert_team", details)
}


 sealed trait Priority extends EnumEntry
 object Priority extends Enum[Priority] {
   case object P1 extends Priority
   case object P2 extends Priority
   case object P3 extends Priority
   case object P4 extends Priority
   case object P5 extends Priority

   override def values: IndexedSeq[Priority] = findValues
 }

}

This utility allows us to easily mark the alerting loglines, as an example:

logger.warn(AlertMarker.p2(), "This is bad")

It is important to note that this step acts as a helper for creating ‘Marker’ instances, a step that can therefore be used with many different logging libraries. This is an obvious advantage when compared to creating a logging-library-wrapper for each library.

We will discuss it in detail soon, but let’s first see the ElastAlert part:

payments-generic-alert:
 name: Generic log alert
 index: vector-*
 type: any
 filter:
   - query_string:
       query: >-
         _exists_:sb_alert_priority
         AND (kubernetes.container_name:"foo-service" OR kubernetes.container_name:"bar-service" OR sb_alert_team:”baz-team”)
 alert:
   - opsgenie:
       opsgenie_priority: "{sb_alert_priority}"
 opsgenie_key: ...
       opsgenie_subject: "Generic: {}"
       opsgenie_subject_args:
         - "message"
       alert_text_type: exclude_fields
       alert_text: |-
         *{0}*
         {6}
         <https://jaeger/trace/{3}|Trace>, Kibana link available in OpsGenie
         Service: {5}
         Source location: {2}
         Alert identifier: {4}
         Stack trace:
         {1}
       alert_text_args:
         - "message"
         - "stack_trace"
         - "sb_alert_source"
         - "ctx.trace_id"
         - "sb_alert_identifier"
         - "kubernetes.container_name"
         - "sb_alert_details"
       opsgenie_details:
         Kibana Url: { field: kibana_discover_url }
 query_key: sb_alert_identifier
 generate_kibana_discover_url: True
 kibana_discover_app_url: …
 kibana_discover_index_pattern_id: ….
 kibana_discover_version: …
 realert:
   hours: 2

Now that everything’s in place let’s explain how this machinery fulfils the previously listed requirements.

Control priority from code
‘AlertMarker’ exposes the ‘p1’ — ’p5’ methods that populate the ‘sb_alert_priority’ field which is then passed directly to ElastAlert in the ‘opsgenie_priority’ field.
Support for separate teams
Each team is supposed to put in place the copy of the alert template and tailor it to its needs. At this point they can set up a dedicated ‘opsgenie_key’ that will deliver the alert to a relevant team in OpsGenie.
Support for service ownership
In the ‘filter’ field of alert definition, it can be configured so all messages from given services are captured. As before, the filter is specific to the given team. The name of the service is provided from the log collector automatically.
Support for shared services
‘AlertMarker’ methods accept a ‘team’ parameter that is then exposed as the ‘sb_alert_team’ JSON field. This field is then included in the’‘filter’ to select all the alerts meant for a particular team.
Alerts separation
ElastAlert has a configuration field ‘query_key’ that will treat alerts separately — e.g. apply ‘realert’ field separately to each distinct key. We populate this field with the JSON field ‘sb_alert_identifier’. Its value can be set up manually, but this would be very cumbersome. Instead we rely on the com-lihaoyi/sourcecode library with which we automatically set the identifier for ‘file_name:line_number’.
Direct link to log browser
ElastAlert comes with a way to generate kibana links, which we use.
Direct link to trace browser
We put the trace_id in logged json, making it really simple to generate the right url.
Providing details for an alert
This is simply another parameter to our wrapper that is propagated to the JSON and can be embedded in the alert body.

Summary

As you can see, we have glued together many different pieces of infrastructure with a minimal amount of code to build a utility that allows us to increase the alert coverage of the system by an order of magnitude.

The result of this thought process and building is a simpler, faster and less cumbersome alert system that fits our needs and wants.

Sometimes the better way is to break things down into their elements and build them back up, with as little as 50 lines of code. 🛎️