riemann.io

I’ve recently started to look into Riemann and how this “thing” works. Since we’re already using it at work, it makes sense having some kind of “basic” understanding of what’s going on and how certain things work.

To keep things simple, I’d like to do the usual basic things…

  • I want to be able to see the metrics in a dashboard, per service only (currently not interested in per-host metrics for the dashboard)
  • When a metric comes in from a service (single or multiple hosts), I want to aggregate that metric and show the 50, 75, 90, 99 and 99.9 percentiles.
  • When the 99 percentile exceeds a certain threshold, I’d like riemann to set the state to “critical” and log the event (for now); when it returns back to below that threshold it is to set the state back to “ok” and log that event

(!) Work in progress (!)

things I write / say here may be completely whack, go against any good practice in programming / operations, clojure and/or riemann configuration and may be incomplete.. you’ve been warned — so use the below information with caution, please. DO NOT (in like ever) copy and run stuff from this page on a production Riemann ;)

Riemann is written in Clojure (a lisp dialect, apparently) and is purely functional, it doesn’t keep state (it does, but it doesn’t, more on that potentially later).

It really helps when you already know Clojure, but if you’re like me, not a developer, it’s rather hard to understand how things work within Riemann if one doesn’t know anything about Clojure.

For simplicity, and my sanity, let’s just say that everything that happens in Riemann is a sequence of nested functions applied to “data” (metrics to be precise). These functions don’t keep state, but they can create new “data” as a result.

data -> 
function(data) ->
function(data) ->
newdata ->
function(newdata) -> expired
data ->
function(data) -> expired

In clojure that can look like this

(function data 
(function data
(let[newdata] rate 5 (with :data))))

Ok, so that starts to already get confusing, so let’s skip ahead and just do something to get started.

Oh, before we do, you probably want to buy the book The Art of Monitoring by James Turnbull — I’ve only bought it for the Riemann info it has in there and is quite cheap (but great quality!)

To make things easy, I’ve just run the Docker container that’s provided locally (using docker-machine) with some helper apps to push data to it.

You’ll either need Docker already installed and running and know how to do basic things with it, or you can follow the installation instructions on the riemann.io website.

run riemann docker

Create the directory $HOME/riemann/etc and put a file riemann.config in that directory with this content

; -*- mode: clojure; -*-
; vim: filetype=clojure

(logging/init :file "riemann.log")

; Listen on the local interface over TCP (5555), UDP (5555), and websockets
; (5556)
(let [host "0.0.0.0"]
(tcp-server {:host host})
(udp-server {:host host})
(ws-server {:host host}))

; Expire old events from the index every 60 seconds.
(periodically-expire 60)

(let [index (index)]
; Inbound events will be passed to these streams:
(streams

(default :ttl 60
; Index all events immediately.
index
; print everything there is because we have no idea about whats going on
prn
)
)
)

now run the docker container …

docker run --rm -ti -p 5555:5555 -p 5556:5556 -p 5555:5555/udp -v $HOME/riemann/etc:/app/etc rlister/riemann

you should see some output similar to this

INFO [2016-06-21 21:11:43,170] main - riemann.bin - PID 1 INFO [2016-06-21 21:11:43,477] clojure-agent-send-off-pool-3 - riemann.transport.websockets - Websockets server 0.0.0.0 5556 online INFO [2016-06-21 21:11:43,564] clojure-agent-send-off-pool-4 - riemann.transport.tcp - TCP server 0.0.0.0 5555 online INFO [2016-06-21 21:11:43,571] clojure-agent-send-off-pool-1 - riemann.transport.udp - UDP server 0.0.0.0 5555 16384 -1 online INFO [2016-06-21 21:11:43,573] main - riemann.core - Hyperspace core online

riemann dashboard

We naturally want to see what’s going in/out of Riemann, so let’s also run riemann-dash from a container

docker run --rm -ti --name riemann-dash -p 4567:4567 rlister/riemann-dash

you should see

INFO [2016-06-21 21:11:43,170] main - riemann.bin - PID 1
INFO [2016-06-21 21:11:43,477] clojure-agent-send-off-pool-3 - riemann.transport.websockets - Websockets server 0.0.0.0 5556 online
INFO [2016-06-21 21:11:43,564] clojure-agent-send-off-pool-4 - riemann.transport.tcp - TCP server 0.0.0.0 5555 online
INFO [2016-06-21 21:11:43,571] clojure-agent-send-off-pool-1 - riemann.transport.udp - UDP server 0.0.0.0 5555 16384 -1 online
INFO [2016-06-21 21:11:43,573] main - riemann.core - Hyperspace core online

open the dashboard

If you’re using Docker on a MAC, I guess by now it’s the same everywhere, just go to http://127.0.0.1:4567

we’ll come to editing the dashboard in few moments, let’s put some data into riemann first

I’ve found some python script on the vast internet world and amended it a little so I can get random metrics (pseudo random, really) and set a different host (for later to test aggregation, etc.)

This is the script I use to pump some metrics to riemann, save it as rie.py

import socket
import time
from functools import wraps
import bernhard
import sys
import random


def wrap_riemann(metric, _host, client=bernhard.Client(), tags=['python']):

def riemann_decorator(f):
@wraps(f)
def decorated_function(*args, **kwargs):

# host = socket.gethostname()
if not _host:
host = socket.gethostname()
else:
host = _host
started = time.time()
try:
response = f(*args, **kwargs)
except Exception as e:
client.send({'host': host,
'service': metric + ".exceptions",
'description': str(e),
'tags': tags + ['exception'],
'state': 'critical',
'metric': 1})
raise

duration = (time.time() - started)
duration = random.randrange(0, 1000)
print(duration)
client.send({'host': host,
'service': metric + ".time",
'description': 'time metric',
'tags': tags + ['duration'],
'state': 'ok',
'metric': duration / 10})
return response

return decorated_function

return riemann_decorator

# dummy data

riemann = bernhard.Client(host='192.168.99.101')


def stuff(host='host1'):
@wrap_riemann('dummy', _host=host, client=riemann)
def send_metric():
time.sleep(1)

send_metric()

if __name__ == '__main__':

hostname = sys.argv[1]
while True:
try:
if len(sys.argv) > 1:
stuff(host='{0}'.format(hostname))
except bernhard.TransportError:
print ("Exception caught...")
pass

Run it with

python rie.py host1 & 
python rie.py host2 &

It will output a Exception caught... when the connection to riemann didn't work.

So, riemann-dashboard is everything but intuitive, so here some quick pointers when you’re on a Mac

  • edit the current dashboard: CMD + Click on “Riemann” (or bottom half where the help text is), this is a “view” btw, it’ll be highlighted “darkened”, now click “e”, you’ll see a pop up with a drop down that allows you to change what that part of the dashboard represents (a graph, text, title, log, etc.)
  • resize a “view”: when highlighted click ALT + SHIFT and - or ALT + SHIFT and +
  • resize a “view” vertically: when highlighted, SHIFT + CTRL left/right arrow
  • resize a “view” horizontally (halfs it): when highlighted, v
  • move “views”: when highlighted, use arrow keys <- | -> up | down(patience ;))
  • save the “dashboard”: when nothing highlighted s

That’s roughly all I know about Rieman-Dash right now.. maybe all I (and maybe you?) ever need to know!

flot

Let’s create a flot that displays all our dummy.time metrics so we can see “something” (other than staring at the console log).

  • edit the bottom half of the dashboard by CMD + Click on the “Help” text, press e
  • from the drop-down menu, select flot
  • give it a title, “flot” might be useful here
  • in the query, type: service =~ "dummy.time"
  • that’s it, click on ‘Apply’, then hit [ESC]

within a few seconds, you should see lines appearing, and a legend in the top left with the “hostnames” that sent the events.

There are obviously other useful visualisations that riemann-dashboard has to offer, but I let you experiment with those.

To quickly explain what I understood with these data queries, basically % seems like a regex 'wildcard' .* to me, you could apply a wildard query for our dummy.time service (to later see 99percentiles): service =~ "dummy.time %" instead of just service =~ "dummy.time". Remember this, we'll use this again later.