Metrics @ Robinhood — Part Three

Published in

Robinhood

6 min readApr 25, 2017

(Riemann)

In the previous post, we described our monitoring and alerting setup, and how Riemann fits into the mix. In the final post of this series, we will explain our Riemann setup, and go over how our engineers use it. This post assumes that you have an understanding of how Riemann works and how it consumes and processes events. Familiarity with the Riemann API, Events, and Streams will help you get the most out of this post.

We are releasing our Riemann configs in the hope that it will be useful to the community. We are not Clojure experts, so please be gentle with your feedback. Pull requests and suggestions to improve this are welcome. Our configuration is heavily geared towards examining metric streams from TCollectors. You should be able to adapt it to other metrics streams, but we currently assume TCollector output format in the metric stream. We expect the metric stream to contain messages that look like this:

put cpu.utilization 1491003331 95.30 type=idle host=foo.example.com roles=dev-box

Event ingestion

Riemann ships with a few different transports — Riemann’s endpoints for ingesting metric data. The built-in transports include Graphite, OpenTSDB, raw TCP and UDP sockets, and a websockets endpoint. As described in the first post, our metrics are stored in Kafka. We use a console consumer, do some sanity checks, and pipe the output into Riemann using netcat. Most of the Riemann transports let you specify custom parser functions which give you the ability to transform the incoming messages into arbitrary event structures as you see fit. The default OpenTSDB parser in Riemann transforms the above event into an Event structure that looks like this.

{
  :service "cpu.utilization type=idle roles=dev-box",
  :metric 95.30,
  :host "foo.example.com",
  :roles "dev-box",
  :type "idle",
  :description "cpu.utilization",
  :time 1491003331
}

While this Event structure is fine to use in Riemann directly, our developers are used to thinking of the above metric as an event with the name “cpu.utilization”. We first plot these metrics in Grafana and use those graphs to arrive at alert thresholds. Similar to OpenTSDB, we treat the type, host and roles as arbitrary tag and value pairs associated with this metric event.

To support filtering metrics using their name, we add an additional “metric-name” field to the Event. In our environment, each machine is assigned a role (webserver, database server etc), and in some cases, machines can have multiple roles. These roles are tacked onto every single event generated by the TCollector daemon on the box. We transform roles in the Events to a list of Riemann Tags: this makes it easier to filter against them and use the built-in tag functions in Riemann. We use a custom opentsdb-parser function to add these fields to the Events generated by the default Riemann OpenTSDB transport.

Rule definitions

Riemann is usually set up and configured using Clojure code. You use the Riemann API to define Streams that can filter and react to events. Our developers are not familiar with Clojure and asking folks to learn Clojure to write rules for their applications was a non-starter.

To help with adoption, we have defined a small DSL that wraps some of the Riemann primitives to make it easier for our developers to define the alerts they need. Our metric rules now look like Clojure hash maps (at least for the simple ones). We have helper functions that take these maps and generate Riemann Streams from them. As is usually the case with most DSLs, the downside is that it is fairly limiting in what you can express. In our case, it allows our developers to easily define rules for a majority of our use-cases without having to understand too much of Clojure or Riemann. It also gives us a uniform platform that abstracts away the details of consuming metric events, defining custom alerts, etc. The DSL syntax and usage is explained in this presentation; we use the same presentation internally to introduce Riemann and the DSL rules to engineers looking to write new monitoring rules.

Rule example

In one of our applications, we track the number of trades that have been processed with a Statsd counter named nlsreader.recognized_trades. Statsd calculates the “count_rate” automatically for this counter and generates a metric line that looks like this.

put nlsreader.recognized_trades 1491003331 124 value_type=count_rate host=marketserver.internal roles=market-server

The following is a Riemann DSL rule to filter and alert on events like this.

{
  :metric-name "nlsreader.recognized_trades"
  :metric [100 200]
  :value_type "count_rate"
  :thread-through [when-lower-r market]
  :alert [(slack-alert "#markets")]
  :modifiers {
    :description "NASDAQ trades per second"
    :flap-threshold 3
  }
}

The above rule definition is a valid Clojure hash map, and this is about the extent of Clojure a developer needs to know to define simple alerts in our environment. The :thread-through key in the hash-map above defines a sequence of filters that have to match for this rule to alert. This rule is translated into a Riemann Stream with the following properties:

It filters for events which have metric-name set to nlsreader.recognized_trades
The Event should have the :value_type field set to “count_rate”.
The filter is active only during US NYSE market hours.
The filter alerts when the value is lower than 100, and recovers when the value is greater than 200.
The final alert will have a description “NASDAQ trades per second”.
The value has to cross the threshold three consecutive times for the alert to fire/recover.
It posts an message in our #markets slack channel, when the alert is triggered/recovers.

We have a few other Riemann primitives wrapped in helper functions to handle our common use cases; take a look at base rules, folds, percentiles, and hours. The DSL also has the ability to combine multiple single metric rules into a more complex Stream. Take a look at our nginx error rate rule for an example of that. There are notification helpers for paging, logging and e-mailing as well. Any combination of these notification helpers can be used in the :alert line in the definition above.

Event viewer

Riemann recommends a pretty bare-bones dashboard called Riemann-dash to view the metrics flowing into the Riemann Index. For the most part, we don’t use the Riemann index directly. We don’t have rules/alerts that query the index, manipulate it, etc. We found this dashboard hard to work with: instead, we use a local single node Elasticsearch instance on the Riemann server to view events. We push 50% of all the events flowing into Riemann (from Kafka) into this Elasticsearch instance. We don’t need to record the entire flow into Elasticsearch, since we only use it occasionally to look at event flow into the system. We run this (Riemann + Elasticsearch) on a m4.4xlarge instance in EC2. We push about 20k events/s during peak through Riemann; sending the entire flow into the local Elasticsearch instance was too much for it to handle. The real data we care about in Elasticsearch is the current state of alerts and we push all of those events (100%) into Elasticsearch. A few things to note about our Elasticsearch module:

We overwite documents in the Elasticsearch index. This keeps the size of the Elasticsearch index in check and also gives us a “state of the world now” kind of view into our checks/alerts.
This also means that there is no notion of historical status/view into our events. We use OpenTSDB/Grafana for looking at longer term metric trends.
Our Elasticsearch module uses async queues to keep it from slowing down the main Riemann process.
Our current TTL for events is 5 minutes; we throw away events that arrive more than 5 minutes after they were generated.

We hope this series of posts on our metric and alerting setup has been useful. As we mentioned initially, this setup is geared towards OpenTSDB style metrics and alerting based off such metric streams. You should be able to adapt this system to other metric styles like Graphite. The Riemann configs released by Guardian show good examples of how you could setup Streams directly. We chose to wrap the Stream setup and instead, expose a DSL (helper functions) to our developers. If yours isn’t a Clojure shop, we hope this makes integrating Riemann into your stack easier. If you find this kind of stuff interesting, we are hiring :)

Metrics @ Robinhood — Part Three

Written by Aravind Gottipati