TL;DR: Copy and paste the gists below to add application monitoring to your Phoenix app. It will look something like this:
Edited July 11, 2015
Updated the plug gist (fixed typo) and repo gist (adjusted to the latest ecto version).
As lots of engineers in the Elixir community I have a ruby background. Rails used to be my go-to MVC framework.
About half a year ago I started using the amazing Phoenix Elixir web framework. It allows you to build super fast web applications. There is support for web sockets out of the box, including native libraries. And it is so well designed that you are super productive using it.
Yet there is one big drawback when compared to rails: No SaaS monitoring tools. All my rails apps have either New Relic or skylight.io metrics. Both of these don’t work with Erlang or Elixir (yet).
This is a big problem: Metrics are extremely important. In case something goes wrong they are my first stop for finding out where the problem may lie. Additionally they let me measure how fast I respond to requests. That is important to maintain a high quality. The most important application is that they give me an idea of my apps health. Good metrics can prevent problems or even downtimes by being an early warning system. Without metrics you are flying blind.
Custom monitoring for Phoenix
Last December I was in Berlin for Erlang Factory 2014. I was lucky enough to attend a talk by Brian Troutwine from AdRoll. It convinced me that Exometer would be the perfect library to do this. You can find his talk here: http://www.erlang-factory.com/berlin2014/brian-troutwine
Exometer is an Erlang package that allows you to collect different kinds of stats. There is two ways to record metrics: (1) Tell Exometer to update a value. (2) Have it call functions regularly itself. It uses ETFs to store the metrics in the Erlang VM. Before exporting them it can aggregate values. That is crucial if you are recording thousands of values per second (like hit counts).
Exometer knows a couple of different metric types. Most of these types are similar in other metric engines and services. You can read more about them in the Exometer read me. I’ll introduce the ones we will be using:
Gauge: This is by far the easiest one. It holds one numeric value. You can update it repeatedly but when exporting it uses the last value. This works well for metrics like memory usage or disk space. We do not care about down-to-the-nanosecond values here. As long as we aggregate regularly enough the latest value is all we are interested in.
Histogram: Histograms are a collection of time-series data points. A good example for this is response times. Histogram values are all recorded and then aggregated. Common aggregations are minimum values, maximum values, averages and percentiles.
Spiral: This type is Exometer-specific. It enables you to increase a value over time. When reporting the value it returns the total and the metric reset. This metric type is great for metrics like hits per minute. You can increase the value by 1 every time until it’s exported.
You have to define Exometer metrics before you can use them. There are two types of creating metrics. (1) Dynamic definition (by using functions during runtime). (2) Static definition (by using a config file). We will be using the second way as we know what we want to collect beforehand.
Exometer consists of three parts: The metric types, the storing service and the exporters. I will explain later how to tie them together.
The true power of Exometer is that it is super robust and lives within the Erlang VM. Thus you can record as many metrics as you like. You will not slow down your application down. You don’t want monitoring to take down your application.
StatsD is an agent that receives metric values using UDP and forwards them to other services. It is comparable to an intermediary. The metrics have to be formatted in a way that makes sense to StatsD. It also does aggregation, too.
There are many agents out there and some SaaS monitoring tools offer custom agents. It is possible to have one StatsD agent forward stats to another agent. That way you can use different monitoring tools at the same time.
DataDog is a SaaS that enables you to send it stats and events and displays these in highly customisable graphs. You can create dashboards with graphs and gauges. Finally you can define events for which you can receive alerts. E.g. when your server’s CPU load gets too high, you perform too many DB queries per second and so on.
Now that we know the ingredients, let us use them to get insights into your Phoenix app. First add the dependencies to your mix.exs. I used PSPDFKit’s forks as they work better with mix:
Defining the metrics and reports
We use an application config to configure Exometer. It defines the metrics we want to record and where we would like to report it. This is by far the hardest part. Let me show you a gist and then walk you through it:
First we set some ‘constants’: What do you want to call your app? How frequently do you want to submit stats? What histogram stats are you interested in? (I am reporting minimum values, maximum values, averages, the 95-percentile and the 90-percentile). What memory stats would you like to record? The ensuing config consists of three parts:
The first part is the predefined metrics we want to use. A tuple of three entries describes each metric. It consists of name (list of atoms), type (list of atoms or single atom) and options. We use four different metric types. The first one is functions. It instructs Exometer to use the return value of an erlang function. The other three are gauges, histograms and spirals as described above.
The second part defines reporters. Reporters send the metric values along to different service. One helpful reporter to debug is to just dump the values to Lager. As you can see we are using the StatsD reporter. It formats values to the StatsD format and sends it to a StatsD agent via UDP. The type_map tells it how to ‘translate’ types if needed. For example the spiral metric type is unknown to StatsD. So we tell the reporter to export it as a gauge to achieve the effect outlined above. The name of a metric in StatsD will be the list concatenated with a full stop (e.g. “my_awesome_app.webapp.resp_time”).
Lastly we define reports. Reports tie metrics and reporters together. They define at what intervals we would like to send the collected metrics onward. I chose to report metrics every second.
Now as you can see we poll the Erlang VM metrics in our config. That’s super helpful. But we still need to record metrics like hit count and response time. If you’re using Ecto you might also want to record query count and execution times. Let’s start with hit counts and response times. I wrote a plug that does it:
Elixir’s plugs make it super straightforward to record metrics like these. Please note that I am matching the :ok that I expect to receive from Exometer. That’s a bit audacious as it will result in a 500 error if recording the metric went wrong. Please decide for yourself if you want to do that or not.
Now for Ecto — again great design makes our lives super easy:
Ecto defines a log function you can overwrite. The code we use to records execution times and query count is like what we use in the plug.
Using your metrics
There are many ways of using your metrics, the most common one is to visualise them. There are a variety of ways to do this. Since we are using StatsD and Exometer we can export our metrics to a variety of formats. One of the formats is for most popular graphing solution graphite.
I used DataDog. It comes with its own StatsD agent that automatically sends the data to DataDog. The agent is straightforward to setup. Plus DataDog provides alerts as mentioned above.
The first thing you need to do is register and install the agent somewhere. The tutorial on the DataDog site is self-explanatory, just follow it.
Then you can use the metrics explorer to see what metrics we report. Build dashboards to your liking following the guides that DataDog have on their homepage.
Currently I am only collecting the average response time. It might be interesting to record response times per method and path to identify worst offenders. Thanks to Exometer’s dynamic metric definition interface that is not a problem. I just haven’t gotten around to doing it. The same goes for Ecto: Recording which queries take the longest might be a smart thing to do however I haven’t gotten around to it yet.
Please feel free to post questions and feedback. This is the result of my first stab at metrics collection so there are likely things I could’ve done better.