Keeping your infrastructure alive with Riemann

Published in

Analytics Vidhya

8 min readJan 23, 2017

Riemann (http://riemann.io) is an awesome tool. There’s no doubt about it. However stiff learning curve and need for Clojure knowledge in order to config, makes it highly unpopular.

At Mintly we managed to integrate it as our monitoring and alerting and monitoring framework and it works flawlessly. However decent amount of blood and sweat was used into creating it. Point of this blog is to guide potential users.

Gaining knowledge about Riemann

First major problem you will encounter — lack of documentation. I mean serious lack of it.

I have found this book: https://www.artofmonitoring.com to be ridiculously helpful. So for anyone considering using Riemann — this is a must read.

After (and only after) this introduction to Riemann, it is useful to view video by creator himself: https://vimeo.com/131385889 . The reason I recommend to view it later, is because Kyle is quite fast speaker, there’s a lot of info going on there (covered by really nice pictures). And is not really possible to consume this info correctly without having an actual background.

How is Riemann different?

First and by far most frequent question. Before getting into this, some theory needs to be put on the table.

There are three types of monitoring [1]

No monitoring/manual monitoring. It is when you’re checking if system is alive by manual tooling
Reactive monitoring. It is when you have service which is periodically polling and checking if concrete server is alive. Nagios is one of most popular examples
Proactive monitoring. It is when each service is reporting it’s own state periodically to some central machine

Monitoring strategy depends on combined maturity of:

Product
Team
Company

Only if all of these parts are mature (Company cares, Developers care and Product itself is valuable enough), only then proactive monitoring solution is used. And these are quite popular these days — most of developers heard about ELK stack and Grafite by now.

Interesting part (a lot lesser know), that Alerting is sharing same rules. And while there’s lots of tools for Proactive monitoring, only a few tools for actual Proactive alerting. And they all tend to be decently complex. Maybe one of more known (but feature itself is mostly in the shadows) — Logstash is capable of doing that by providing some extra setup. In this perspective, Riemann is able to bring Proactive alerting which is flexible and fast.

How proactive and reactive alerting differs?

In short, with reactive alerting, you’re capable to alert only by trend. You cannot (for the most part) react to single event. This due the fact, that you’re working with processed data which is often aggregated. Lots of discussion can be done on this topic, and for most cases this is even more than enough. Usual data you want alerts on is all about trends. Eg. HDD space is running low — alert, server load is too high — alert, service 99th percentile response times are too high — alert. But if you have, for example 1000 events/per second, with 99th percentile you’re losing 10 events. Even with 99.99th percentile you can potentially loss events. This may be good enough for the most cases, but what if each event is business critical? I guess banks wouldn’t agree with ignoring few events once in a while.

Proactive alerting comes in here. They work like a filters, and every event is going through.

It’s time to remove mist of confussion

Riemann itself is not another monitoring framework. Actually it is event processing-routing tool, which by standing on shoulders of giants (ELK, Graphite etc) can create monitoring infrastructure and is able to alert on raw alerts, again by standing on shoulders of giants — writing message to Slack/Hipchat, sending email, or triggering pagerduty. So it is usually an addition to the system rather than change.

Problems integrating into existing environment

So the main concern — where to put it. If infrastructure is fresh, then it should be no problem to add Riemann clients in each piece of software and send everything directly. However, more likely scenario is that you have aggregator which is working under StatsD protocol. Riemann is kind of replacement of StatsD (it does everything what StatsD does plus more), but it doesn’t accept messages formatted this way (Accepts graphite format). If you’re not afraid of introducing JVM stack, Riepete can be a solution.

Working with passive (pull) type metrics

In some tools, pushing metrics to Riemann is harder than in others. For example MySQL, PostgreSQL, Redis, but these metrics can be actually very valuable. For these cases Riemann itself is a bad fit (same goes for Graphite and ELK). Solution for this problem might be more trivial than you think — you simply need data collector. CollectD does an awesome job. For things you cannot get from CollectD (eg. plugin is missing, or doesn’t work the way you want), there’s Riemann Tools which is basically a Ruby app querying metrics from various sources.

Making it simple

Most of the stuff mentioned above can (and should) sound a bit frightening. Certainly too much overhead for small projects (Remember maturity rule? Project needs to be mature as well). For making it simpler (and avoid having JVM + Ruby stack only for monitoring), we’ve created Oshino.

Oshino is StatsD + CollectD + Riemann tools (Although CollectD is still relevant for Hardware Metrics). It is able to accept push metrics (in StatsD format) and pull metrics (query various stuff).

It may be a bit opinionated, because there’s no contributors from outside (at the moment).

Riemann’s setup

This is the most controversial part, since config is infamously done via Clojure code. However, it is not as bad as it sounds, because overall Clojure syntax isn’t bigger than most of DSLs, all magic happens in logic. And author did a great job making that logic flow as simple as possible.

Full setup goes like this:

(let [host "0.0.0.0"]
  (tcp-server {:host host})
  (udp-server {:host host})
  (ws-server  {:host host})
  (repl-server {:host "127.0.0.1"}))
; Expire old events from the index every 15 seconds.
(periodically-expire 15 {:keep-keys [:host :service :tags :metric]})
(require '[mintly.etc.influxdb :refer :all])
(require '[mintly.etc.email :refer :all])c(let [index (index)](streams
    (default :ttl 60
      index(where (tagged "deploy")
              index
              #(info %)
              persist-influxdb
      )(where (tagged-any ["statsd", "riepete", "sincity", "view", "collectd", "db", "auth", "api"])
         index
         persist-influxdb
      )(where (tagged "traffic")
         index
         persist-influxdb-w-tags
      )(where (tagged "discovery")
        (changed :state
            (rollup 2 3600
                (email "alerts@mintly.eu")
            )
        )
      )(where (tagged-all ["oshino", "heartbeat"])
         (changed :state
           (email "alerts@mintly.eu")
         )
      )
    )
  )
)

Yes, I know, it’s very confusing at first sight. So lets break it down.

First of all

(let [index (index)]
  (streams
     ...
))

Consider this just a setup. Like a “main” function inside programming languages.

Another thing, each expression looks likes this

(<function> <arg1> <arg2> ... <arg#>)

So, default expression is actually a function “default” which sets argument “ttl” (time to live) to 60 if it is not set

(default :ttl 60 ...)

Actual logic begins with “where” expressions, at this point you’re branching your event into separate streams, in which it will be processed accordingly.

“where (tagged …” expressions seems to be self-explanatory. More interesting one is:

(changed :state ...)

With “:state” you’re usually passing information relative to state of that object, eg. service can be “down” or “up”, or you can measure average time of request/response time and split it into “error”, “warning”, “ok”. While state remains constant, no events are actually triggered, but when it changes — you’re notified. This way you see when your services goes down, and when it is up again. For us it proven to be extremely useful mechanism.

These alerts quickly leads to spam. To mitigate that, there are two awesome functions “throttling <events_count> <time_span>” and “rollup <events_count> <time_span>”

Both functions allows X events to be sent in defined time period. Difference is slight: throttling drops overflow events completely, rollup aggregates them and later sends as a list. Rollup sounds nicer, but with a cost — each stored event eats some of RAM. More advanced scenarios combines these two, that allow to aggregate certain amount of events (eg. 1000) and don’t overflow memory.

(require '[mintly.etc.influxdb :refer :all])
(require '[mintly.etc.email :refer :all])

Are our custom code imports. Basically influxdb part consists of write functions to database, and email one uses our external API to send emails.

NOTE: InfluxDB is used here instead of Graphite due to some other reasons which are not that relevant at the moment. They can be used in same manner.

Self Healing Infrastructure

One a bit unique feature of Riemann is that you can use it for healing your infrastructure. It happens by simply executing commands from external JAR on certain conditions. That JAR can be created with any JVM-based language, but of course Clojure or Java creates the least problems.

Extra Jars can be added updating Riemann’s sysconfig file with such line:

EXTRA_CLASSPATH=/path/to/your/jar

You then include this code via Clojure’s import statement, and execute as a normal functions.

One of the actual use cases could be spawning extra AWS instance at peak times to handle the load. With a little bit juggling with API it is quite easy to do, and brings a lot of value. However, be aware that blocking procedures can stop current Riemann’s event cycle. To prevent that, you need to write code which deals with that in async fashion.

Scaling Riemann

It’s quite clear that single node will not quite cut it. No one wants nexus point. However, Riemann does not provide any trivial path for that, it only gives you ability to find it yourself. This ability comes from the fact, that Riemann is able to receive and send on TCP port. Looking from a bit different angle, you can understand that there’s nothing stopping you from sending same event from Node A to Node B.

In general you want to have master-slave type structure. You add slave instances on each machine and make them react to local problems (eg. HDD space is running low on this node) and send more global ones to master node. To make it more resilient you can actually make few master nodes, load balance them, setup replication between etc… It’s up to your knowledge and imagination

Config usually looks like this:

(let [index (index)downstream (async-queue!
                    :agg-forwarder        ; A name for the forwarder
                    {:queue-size     1e4  ; 10,000 events max
                     :core-pool-size 4    ; Minimum 4 threads
                     :max-pools-size 100} ; Maxium 100 threads
                    (forward
                      (riemann.client/tcp-client :host "<address of riemann master node>")))      ]

All this “async-queue!” stuff is a bit confusing. The structure looks like this:

(async-queue! :<name of this queue> {<queue parameters>} 
  (<code to execute>))

Later, you simply execute (either in main code branch, or after some “where” statement):

(batch 100 1/10 downstream)

command “downstream” is good enough on its own to deliver event to master, however to do it more efficiently, we group events into batches.

Security

At this point some people can start worrying about security, since you can be sending some valuable information. Riemann’s strategy here is very simple: It accepts TCP, UDP and WebSocket. Each protocol allows you to create TLS encrypted connection. That’s all.

Benchmarks a.k.a. I want to shoot all my events to Riemann

I’m yet to see better optimized JVM usage. It does squeeze a lot from Netty framework. However, it does boils down to the limitations of Hardware, OS and JVM itself. Also, it does depends on your config setup. Semi-official benchmarks can be found in here https://aphyr.com/posts/279-65k-messages-sec . Legends say that on production-tier hardware, single node can handle up to few millions requests per second. Cannot prove nor deny it till I’ll see with my very eyes.

To put it bluntly — if you’re having Riemann instance per node, it is very unlikely that you’ll be able to reach it’s peak faster than your service does. However, would be nice to get any experiences on that.

Sum Up

In general, Riemann is very flexible and useful tool, but it requires some love. Basic setup is mostly easy, however for complex one you might need a little bit deeper Clojure knowledge.

I would recommend it in situations when there’s a real need to make sure that every event is handled correctly. Not many products require that. When it does, however, you can be deeply challenged.

[1] “The Art of Monitoring”, James Turnbull