metronome


Annotation

In the article considered the library for work with graphite-metrics in Erlang. It’s compared with existing libraries such as synrc/mtx, boundary/folsom, campanja/folsomite and feuerlabs/exometer. Also, article describes the options for using the library as part of a full Erlang-project.


Introduction

In life cycle of most projects, there comes a time when you not only need to write log-files, but also produce intsrumentary code and business logic, collecting various metrics in runtime and present them graphically. The advantages of this approach are obvious — human eye perceives visual information much easier than a text. Charts are variative (linear, columnar, circular, etc.), they make it easy to keep track of the trend, with charts can be covered a large period of time, moving in time both forward and backward, which, of course, log-files can’t allow.

Collecting metrics and building charts is only half the battle, the main thing is to do it right. It’s important to understand the nature of metrics to use the correct functions, aggregates, and already on the basis of this to be able to display them properly. Accordingly to Coda Hale there are 5 types of metrics: counters, meters, gauges, histograms, and timers. In this article, I propose to focus on first three ones.

  • counters — conventional enumerators which represent a simple increment or decrement over scalar value. When abstracting, you can imagine it in the form of a speaking coffee machine, which has lost it’s memory, and all it remembers is, for example, the last 5 seconds, and the number of cups with fragrant coffee, which it filled during that time. As a result, everything that a speaking coffee machine will produce (of course except coffee), is phrases like this: “0 poured cups of coffee, poured 1 cup of coffee, poured 0 cups of coffee, poured 1 cup of coffee!” etc. If to think about the nature of this type of metrics, it will become clear that to aggregate them on a time interval, for example 60 seconds, function of the sum will approach best of all.
  • meters — the same enumerator as the counter, but with but monotonically increasing over scalar value. The intuitive presentation this type of metrics can be represented as a water consumption counter whose value is measured in cubic meters and constantly growing with the water consumption. Or it’s the same speaking coffee machine with good memory, which managed to be repaired, and now, except production of coffee, it’s still able to remember number of the cups filled since that moment when it was included in an electric network. As a result, coffee machine will give the phrase: “0 poured cups of coffee, 0 poured cups of coffee, poured 1 cup of coffee, poured 2 cups of coffee, …, poured 100 cups of coffee!”. After turning off and on the coffee machine: “0 cups of coffee are poured, 1 cup of coffee is poured, 2 cups of coffee are poured…, 15 cups of coffee are poured…”. The proper function depends on how this value is used, for example, to learn how many cubic meters of water is consumed during a day, it’s enough to use a derivative.
  • gauges — the instant measurement attached to a certain point in time. In intuitive representation it’s the speedometer in a car. If to write down indications of a speedometer each 5 seconds, for example, then in T1 time point the speed of the car will reach 30 km/h, in T2 time point — 50 km/h, and, it can quite be that in 2 seconds the speed of the car will increase to 70 km/h, however in 3 seconds by the time of T3 time, the speed of the car will be again equal 50 km/h — exactly this speed will be recorded. To pick up the correct function for this type of metrics doesn’t big deal, it’s enough to calculate average value or a median on a time interval.

Motivation

At a certain time point (one and a half years ago), there was a need to make instrumentary code and business logic, and also to collect various metrics describing the Erlang virtual machine behaviour such as: memory consumption, processes count, garbage collection, context switching, etc. As the fast solution was developed the library which represents the collector of predefined-metrics on the basis of any user functions, and client for sending metrics to system responsible for store, processing and display received metrics in diagrams (hereinafter target system). The solution was temporary, and, it was assumed that further it would be replaced with boundary/folsom together with the reliable client sending metrics to target system under the TCP protocol. However over time the library was integrated into some internal projects on Erlang, and successfully coped with the tasks. Some times ago business requirements were shown to functionality of library: it has to be able to accept the user metrics from runtime, depending on type of a metrics should be able to accumulate values over the time and transfer it to target system. From business requirements follow technical, one of them is the reliable client bringing metrics under the TCP protocol to target system. If the graphs constructed on the basis of the sent metrics show only a trend, reliability of delivery to some extent can be neglected. On the other hand, if it’s necessary that the trend was most approached to reality and the error was minimum, you shouldn’t neglect reliability of delivery. The following technical requirement follows from the business requirement of accumulation of metrics — aggregation of the accumulated metrics on a certain time interval. This requirement becomes highly relevant when the number of servers exceeds 100+ and the simple graphite-server dies under such loads. And the last — metrics after their transfer to target system have to invalidate automatically only if data were delivered without errors.

Now it’s necessary to understand how implemented work with the network in Erlang and why it’s important. Neither graphite- no statsd-protocol support of delivery garantees of data, working as the text protocol over TCP/IP and entirely relying on it. In Erlang, work with the network is implemented through a network port driver (hereinafter network driver). Network driver establishes a connection to the remote side, working directly with a socket which has it’s own send and receive queue. Working with the network driver is implemented through the module prim_inet, by means of the send functions for the TCP protocol and sendto for the UDP protocol. After sending data to the network driver, function blocks and waits {inet_reply, S, Status} message. In case of connecting to the remote side, send data timeout was not specified, there can be a situation when function is blocked, expecting the message which will never come. This can happen, for example, if the remote side has stops to read data from the network buffers. Firstly, on the remote side receive buffer is overflowed. Then, TCP-protocol controls the sending of data, “closing the window” by setting window value in 0. After that on the sending side the send buffer starts to accumulate data. When filling the send buffer, the network driver doesn’t reply by the message {inet_reply, S, Status} after data were transferred to it for sending. As a result — the loss of the data, because the message, which was sent to the driver, “settled” in it’s network buffer and probably will not be sent anymore.

The the only solution — more thoughtful control of network connection. First of all it’s necessary to establish a timeout on sending data to a socket with parameter {send_timeout, Timeout In Ms}. Thus, if prim_inet:send for any reasons is blocked, in most cases after expiration specified timeout function will return value {error, timeout}. The following two parameters which it’s necessary to include are: {delay_send, false} and {nodelay, true}. The first parameter orders to the driver not to accumulate data in the queue, and to send them as soon as possible to a socket. The second parameter is the socket parameter, which turning on the option TCP_NODELAY, which ordering to transfer immediately obtained data (even small volume) to the remote side without buffering. Similar setup of the network connection will allow to reduce considerably probability of losing data on delivering into target system.


Comparison

Below is given small overview of the strengths and weaknesses of the popular libraries for working with metrics in Erlang.

synrc/mtx:

As transport for delivery metrics used the UDP protocol that already doesn’t guarantee reliability of delivery; at the library layer reliability of delivery is also not guaranteed since when having problems with sending data, process is crashed and restarted by the supervisor (strategy of “let it crash” is used) — data are lost in both cases. Sending a metric to mtx uses the asynchronous call of gen_server:cast which leads to immediate sending data to the network driver. A possibility of data accumulation isn’t provided in library. Sending a large number of metrics, for example 1K+ in a second, can lead to the fact that the code of library will be blocked (as a result, growth message queue) on a call of gen_udp:send which will wait for the answer from prim_inet:sendto which making selective recevie waiting for the message from the network driver. If the mtx process messages queue is big, it can affect not only library performance, but also whole application.

pros:

  • Support of all types of metrics (counters, meters, gages, histograms, timers);
  • Support of the statsd-protocol.

cons:

  • As the transport used UDP protocol;
  • There is no guarantee of correct sending data at the library layer;
  • There is no support of accumulation of metrics and sendings through the time interval;
  • There is no support of predefined-metrics on the basis of any user functions;
  • There is no support of collecting statistics of operation of the Erlang virtual machine.

boundary/folsom:

The library has full support of all types of metrics (counters, meters, gages, histograms, timers) and allows to accumulate them in time. However, any transport in library is absent and supposed that selection and transfer metrics to target system will be implemented by third-party libraries, or the developer’s forces. In a consequence of the chosen work model with metrics, the library doesn’t support automatic metric invalidation.

pros:

  • Support of all types of metrics (counters, meters, gages, histograms, timers);
  • Support of accumulation of metrics;
  • Support of collecting statistics of operation of the Erlang virtual machine.

cons:

  • There is no built-in graphite/statsd of the client;
  • There is no support of an automatic invalidation of metrics;
  • There is no support of predefined-metrics on the basis of any user functions.

campanja/folsomite:

It just represents a wrapper over boundary/folsom library. As a transport for metrics delivery used the TCP protocol which over is implemented the text graphite-protocol. For sending accumulated metrics the asynchronous call of gen_server:cast is used. In the library there can also be a situation when the code of library is blocked on a call of gen_tcp:send which will wait for the reply from prim_inet:sendto. Which making selective recevie waiting for the message from the driver since at connection establishing the library doesn’t provide a timeout of sending data. In case of error of sending data the library finishes process in the context of which connection with target system is established. However, due the architecture of boundary/folsom library, data which were transferred to the network driver won’t be lost, and will be sent on the following iteration through the time interval. In library functionality of predefined-metrics on the basis of any user functions isn’t provided.

pros:

  • Support of all types of metrics (counters, meters, gages, histograms, timers);
  • Support of accumulation of metrics and sending through the time interval;
  • Support of collecting statistics of the Erlang virtual machine;
  • As the transport protocol TCP is used;
  • Support of the graphite-protocol.

cons:

  • There is no guarantee of correct sending data at the library layer;
  • There is no support of an automatic invalidation of metrics;
  • There is no support of predefined-metrics on the basis of any user functions.

feuerlabs/exometer:

The library has a big set of various functions, supports all types of metrics (counters, meters, gages, histograms, timers), provides the interface to work with metrics from boundary/folsom library. As a transport for metrics delivery used the TCP protocol over which is implemented support of text graphite- the statsd-protocol. As well as all considered libraries, this library doesn’t provide a timeout of sending data therefore there can be a situation when the code of library is blocked on gen_tcp:send call. In case of an error of sending data the library attempts to re-establish connection, if new connection is established — the library behaves as if data were successfully sent and that is very strange. In library the automatic invalidation of the sent metrics isn’t provided.

pros:

  • Support of all types of metrics (counters, meters, gages, histograms, timers);
  • Support of accumulation of metrics and sending through the time interval;
  • Support of collecting statistics of the Erlang virtual machine;
  • As the transport protocol TCP is used;
  • Support of graphite and statsd-protocol;
  • Support of predefined-metrics on the basis of any user functions.

cons:

  • There is no guarantee of correct sending data at the library layer;
  • There is no support of an automatic invalidation of metrics.

As the considered libraries didn’t meet all imposed requirements, it was decided to add the required functionality into the library and later support it.

juise/metronome:

The library has support three types of metrics: counters, meters, gages. As a transport for metrics delivery used the TCP protocol over which is implemented support of the text graphite-protocol. Connection with target system is established using the recommendations provided in the section “Motivation”. After successful sending data the library automatically invalidates the sent metrics, metrics that have not been sent, would be sent on the next iteration through the time interval. For collecting statistics of the Erlang virtual machine the mechanism of predefined-metrics on the basis of any user function is used.

pros:

  • Support of accumulation of metrics and sending through the time interval;
  • Support of collecting statistics of the Erlang virtual machine;
  • As the transport protocol TCP is used;
  • Support of the graphite-protocol;
  • Support of predefined-metrics on the basis of any user functions;
  • Guarantees of correct sending data at the library layer;
  • Support of an automatic metric invalidation.

cons:

  • feedback are welcome!

Architecture and use cases

The core of library consists of three modules: metronome, metronome_core and metronome_graphite. The metronome module provides the basic interface for working with metrics: addition, updating, viewing and removal. Also the module provides a set of functions for collecting statistics of the Erlang virtual machine. Adding metrics together with a name and value, is recorded the time corresponding to the beginning of the current time interval. If the metrics with a name <<”foo.bar.baz”>> is updated several times during one time interval, value of a metrics will simply be updated, otherwise the new metrics with time corresponding to the beginning of a new interval will be added (see an example below). After the metrics was added, it hits to the internal ets-table which works in the context of the metronome_core module. The metronome_core module is engaged in preparation of metrics: once in the time interval, the module automatically collects predefined-metrics, makes the user inline of substitution over the accumulated metrics and transfers to the metronome_graphite module for sending to target system. The metronome_graphite module represents a transport for sending metrics to target system, using the TCP protocol. In case of successful sending data the metronome_core module invalidates the sent metrics. If in the process of sending data there was an error, the metronome_graphite module re-establish connection with target system, metrics which failed to send, would be sent on the next iteration through the time interval.

For creation or updating a metrics just a call the update/3 function of the metronome module, having specified in parameters a name of a metric (Name), value (Value) and type (Type: counter, meter or gauge) or just use alias functions: update_counter/2, update_meter/2 or update_gauge/2 with the same set of parameters except the type:

1> metronome:update(<<”foo.bar.baz”>>, 1, counter).
true
2> metronome.update_counter(<<”foo.bar.baz”>>, 1).
true

and after 120 seconds

3> metronome:update_counter(<<”foo.bar.baz”>>, 1).
true

For viewing values of a metric <<”foo.bar.baz”>> just a call the get/2 function of the metronome module, having specified in parameters a name of a metric (Name) and type (Type: counter, meter or gauge) or just use alias functions: get_counter/1, get_meter/1 or get_gauge/1 with the same set of parameters except the type:

1> metronome:get(<<”foo.bar.baz”>>, counter).
[#metric{name = {<<”foo.bar.baz”>>, 1443528270}, value = 2, type = counter},
#metric{name = {<<”foo.bar.baz”>>, 1443528390}, value = 1, type = counter}]
2> metronome:get_counter(<<”foo.bar.baz”>>).
[#metric{name = {<<”foo.bar.baz”>>, 1443528270}, value = 2, type = counter},
#metric{name = {<<”foo.bar.baz”>>, 1443528390}, value = 1, type = counter}]

For viewing the accumulated metrics of a certain type just a call the get/1 function of the metronome module, having specified in parameters metric type (Type: counter, meter or gauge) or just use alias functions: get_counter/0, get_meter/0 or get_gauge/0:

1> metronome:get(counter).
[#metric{name = {<<”foo.bar.baz”>>, 1443528270}, value = 2, type = counter},
#metric{name = {<<”foo.bar.baz”>>, 1443528390}, value = 1, type = counter},
#metric{name = {“foo.bar.baz”, 1443528780}, value = 3, type = counter},
#metric{name = {<<”foo.bar”>>, 1443528780}, value = 9, type = counter}]

For remove the metric with name <<”foo.bar.baz”>> just a call the delete/2 function of the metronome module, having specified in parameters a name of a metric (Name) and type (Type: counter, meter or gage) or just use alias functions: delete_counter/1, delete_meter/1 or delete_gauge/1. As result the function return the count of removed metrics. For removal all metrics of the some type just a call the delete/1 function of the metronome module, having specified in parameters metric type (Type: counter, meter or gauge):

1> metronome:delete(<<”foo.bar.baz”>>, counter).
2
2> metronome:delete(counter).
2
3> metronome:delete(meter).
0

The sample basic metronome configuration, sys.config:

{metronome, [
{period, 10000},
 {inline, [{“%local%”, “local”},
{“%global%”, “global”}]},
 {graphite_host, “127.0.0.1”},
{graphite_port, 2003},
 {predefined, [
{“%local%.%node%.erlang.memory.ets.gauge”,
{erlang, memory, [], ets}, gauge},

{“%local%.%node%.erlang.processes.gauge”,
{erlang, system_info, [process_count]}, gauge},

{“%global%.%node%.erlang.gc.meter”,
{metronome, system_status, [garbadge_collection], gc_count}, meter},

{“%local%.%ode%.erlang.lhttpc.conn.gauge”,
{myapp, lhttpc_connections_cnt, []}, gauge},

{“%local%.%node%.erlang.cowboy.conn.gauge”,
{fun() -> ranch_server:count_connections(http) end}, gauge}
]}
]}

Here parameter “period” defines a time interval (in milliseconds) through which metronome will transfer the accumulated metrics to target system. The “graphite_host” and “graphite_port” parameters define a host and port of target system respectively. Parameter “inline” allows to do any user substitutions in names of metrics. For substitution of a name of a host just use library substitution %node%. The predefined parameter allows to define metrics which values will automatically gathered by any user functions with the interval defined in the period parameter. The format describing a metrics in record “predefined” is the following:

{Name, {F}, Type}
{Name, {F, E}, Type}

or

{Name, {M, F, A}, Type}
{Name, {M, F, A, E}, Type}

where Name is metric name, M — module, F — function, A — parameters and E is proplist key, the which value should be used, if function F returns proplist.

By example, the function lhttpc_connections_cnt in myapp module may look like:

lhttpc_connections_cnt() ->
Childs = supervisor:which_children(whereis(lhttpc_sup)),
Pids = [Pid || {_, Pid, _, _} <- Childs],
lists:foldl(fun(Pid, Conns) ->
try element(7, sys:get_state(Pid)) of
TableId ->
ets:info(TableId, size) + Conns
catch
_C:_R ->
Conns
end
end, 0, Pids).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.