Usage and performance monitoring with Graphite

A key to improve, speed up and run your applications smoother is to know, how do the applications run now. In this article, we will look into how do we in Socialbakers monitor and observe performance of our microservices.



Graphite is a well recognized monitoring tool that basically does two things:
Stores time-related data in an efficient way (through Whisper database).
It also provides interface for basic visualisation of the stored data and provides mathematical functions to sum/group/scale stored data in realtime.


StatsD is a simple service written in node.js that listens on an UDP port and forwards requests into the Graphite (via its TCP plaintext protocol). Here are the reasons why we decided to use StatsD:

It works on UDP so the process that sends some metrics to the graphite doesn’t need to wait for ACK. That means that the message might not be properly delivered, but we are completely OK with that — we can afford to lose small number of requests in exchange for better CPU utilization, it will not affect our system in any way. Or if some counters won’t be 100 % accurate (we are talking about millions per hour — so when we log 998 000 instead of 1M), it’s really not that important. It aggregates data in memory during a so-called flushInterval and then sends the data to Graphite. This can really improve network data being transmitted — i.e. if an event occurs once per second, instead of sending 60 times a metric with counter +1, it just sends one request with +60. There is a better node.js library support for communicating with StatsD than with raw Graphite.


Even though Graphite itself provides an interface for visualizing the data, it’s not the best tool. For example you can’t draw a column chart and once you get it to render a JPEG image, its not so simple to customize the visible metrics, timeranges, zoom into the chart or for example dynamically group the data into intervals (= req/h into req/30min). But what does Graphite provide is a JSON API, which allowed other tools to be created. One of them is Grafana — a tool that allows you to configure different dashboards with predefined metrics, metric groups, etc… Plus all the charts are dynamic, you can zoom them, select only specific line(s) or change the chart type.

Our setup

Because we really like scaling (and utilizing each and every resource), even StatsD itself can be scaled. We run (on one host dedicated exclusively to StatsD) StatsD Cluster Proxy and five instances of StatsD itself. As I mentioned above, StatsD optimizes network bandwidth and this configuration helps us to use all of the CPU cores on the StatsD server.

Next in the cascade are the Carbon daemons (Graphite itself consists of two parts — graphite web and the backend — Carbon daemon that persists the data). We use a Carbon relay (that runs on the same server as all of the StatsD processes) with two Carbon backends running on two separate machines. Reason for having so many things on one machine is simple. As it is AWS c3.2xlarge instance with 8 cores we have: 5x StatsD, 1x Proxy, 1x Relay and 1 core still free for the OS.

The configuration files are really simple:

nodes: [
{host: '', port: 8127, adminport: 8128},
{host: '', port: 8129, adminport: 8130},
{host: '', port: 8131, adminport: 8132},
{host: '', port: 8133, adminport: 8134},
{host: '', port: 8135, adminport: 8136}
udp_version: 'udp4',
host: '',
port: 8125
forkCount: 5,
checkInterval: 1000,
cacheSize: 20000

graphitePort: 2003,
graphiteHost: "localhost",
debug: false,
dumpMessages: false,
port: <%= @port %>,
mgmt_port: <%= @mgmt_port %>,
flushInterval: 60000,
deleteIdleStats: true,
graphite: {
legacyNamespace: false,
globalPrefix: "application"



LOCAL_DATA_DIR = /opt/graphite/storage/whisper/

As you can see, the StatsD Proxy sends everything to, the StatsD itself sends everything to “localhost” and finally carbon relay forwards the message to the carbon{1,2} hosts.

And last thing — how do we keep stored metrics in time ( = Carbon retention periods)

pattern = ^carbon\.
retentions = 60:90d

pattern = ^application\.
retentions = 60s:24h,5m:90d,60m:5y

Monitoring of HERA

So now, when the setup of our monitoring platform is clear, we can move on to the explanation of how do we use it in real life.

Basically, we have 3 sources of our metrics:

Supervisor running on each application server reports metrics about the processes (workers, brokers…): uptime, CPU, RSS Brokers are reporting the number of calls on each endpoint, response times, errors, timeouts… Workers themselves report just one metric: the event loop delay — the only metric that is difficult to measure (in production), difficult to debug and that can really affect the application in a negative way. We would really like to get rid of this direct reporting from worker to StatsD (cleaner architecture, less dependencies, …), but reporting this metric to broker and then re-sending it to the StatsD would be too much overwhelming so we postponed it for now.

Supervisor metrics

Supervisor reporting is simply done by Sensu checks and its simple gauge-metrics (in Graphite terminology).

The Sensu CPU check is a really simple bash script:

SCHEME=server.$(hostname -s).supervisor
APPS=$( supervisorctl status | awk '/RUNNING/ { gsub(/,/, ""); print $1 "|" $4}' )
TIMESTAMP=$( date +%s )
for APP in $APPS; do
PID=$( awk -F'|' '{print $2}' <<< $APP )
NAME=$( awk -F'|' '{ gsub (/~/, "_-_") ; print $1}' <<< $APP )
CPU_PARENT=$( ps --pid $PID -o pcpu --no-headers )
CPU_CHILDREN=$( ps --ppid $PID S -o pcpu --no-headers | awk '{s+=$1} END {print s+0}')
CPU=$( echo "x=$CPU_PARENT + $CPU_CHILDREN; if ( x>0 && x<1 ) print 0; x" | bc )
echo $SCHEME.$NAME.cpu_usage $CPU $TIMESTAMP

As you can see in the metric name, we have the app server hostname, application name and metric name. This allows us to render a chart like this:

You can see that the builder app consists of two different processes (backend and daemons), each deployed on three servers: app{2,3,4}. Definition of this metric is simple as they are gauges, one per process per server and we don’t need any post-processing:

aliasSub(*.supervisor.builder-new-suite_-_backend.cpu_usage, '*).supervisor.(.*).cpu_usage', '\1 \2')
We are creating an alias to transform the metric name{3,4}.supervisor.builder_-_process.cpu_usage to a more readable app4 builder-daemons. That weird symbol combination _-_ is there, because the supervisor sets the process-names like builder~daemon and the ~ character would get stripped when sent to Graphite. That’s why we change it to _-_ in the Sensu script above.

Broker metrics

Broker reports the most metrics from all of the HERA entities. Each request that goes through our API platform gets logged: counts (total/error/timeout) per endpoint, its response time and per caller (who originally initiated the request).

About the caller — as we mentioned, we have a usecase where one API worker needs data from another worker (and this might get chained deeper). And all those chained requests (we will hopefully cover them in one of the next articles) begin with a real customer sitting behind his computer. Or with a daemon generating some report for that specific customer. But always, we are making something for our clients. We don’t cross-call our services just because we can… :) We originally wanted to track down, how much is each of our customers “using” our platform from each product (= how much of our servers’ capacity each client is consuming through Analytics, through Builder etc…) — Sales & Marketing guys might be interested in such reports.

So our metric naming convention would be something like:


We quickly found out that so much detailed logging is impossible:
Let’s say we have 1000 endpoints, 5 actively used products (as we recognize them on HERA), 3000 accounts (customers), one whisper file is approx. 800 kB large (according to our retention-periods mentioned above) and for the timer metrics — Carbon generates 14 wsp files. That means: 10005300080014 ≈ 156 TB. Kinda lot of space for just “logging” :). When we gave up the per-customer resolution, the disk usage went down to some 53 GB.

So the most important metric brokers are monitoring is the timer for response times of each endpoint (per product) and total counter per product. This allows us to draw charts like:

Request count vs response time
alias(hitcount(sumSeries(*.rate), ‘$interval’),”Total requests”)
alias(averageSeries(application.timers.sbks.hera.requests.endpoints.time.POST_0_metricsapi.*.upper_90), “Response upper 90%”)

You can see that in the first metric (left Y-axis), we are summing all the series (per product) to get the total count. Then, we are applying the Graphite’s hitcount function to get the counts per some larger interval (the rate is “per second”). On the right Y-axis, there are two timer metrics — and you obviously want to average them per product.

Or another chart — with the per product visualization:

Request count per products

Nice thing to mention is showing the top 10 most “calling” applications and grouping the rest into “Others”. We use the Grafana’s templating functionality (as in the hitcount grouping in the previous example).

aliasByNode(hitcount(highestAverage(groupByNode(*.rate, 8, “sumSeries”), $topN), ‘$interval’), 0)
alias(hitcount(diffSeries(sumSeries(*.rate),highestAverage(groupByNode(*.rate, 8, “sumSeries”), $topN)), ‘$interval’), “Others”)

The first metric is using the highestAverage function to show only top N series. The second one afterwards takes all of them and subtracts the results from the first metric.

Worker metrics

As mentioned in the beginning, each worker is just sending one timer — his event loop delay. Nothing different from the previous examples.


Past experiences with debugging and monitoring have taught us that it is very important for us to have flexible and detailed monitoring. While metrics like timeouts per endpoint, average response time and total requests count show us the general health of our HERA platform, there are also per-api metrics like event loop delay, per endpoint response time / errors / timeouts which help us to analyze what is happening in different parts of HERA. Together with Kibana ( we also log full request body + response of all calls that fail with some error into Kibana — but that’s topic for another article) we are able to see what caused the problem and with token-based authorization we can quickly tell, who could possibly cause the problem or who is affected.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.