Aggregating Everything with collectd

G Gordon Worley III
AdStage Engineering
5 min readDec 13, 2016

At AdStage, the Site Reliability team collects lots of system metrics with collectd: CPU stats, disk stats, context switch stats, network stats, process stats, even collectd stats. This generates a lot of time series data for us to aggregate. So much data that we overloaded the current capacity of Sumo Logic metrics to aggregate our thousands of time series, so we had to switch to a tiered collectd setup.

In a tiered collectd setup, collectd runs on each host — be it a container, virtual machine, or bare-metal — and instead of shipping its data directly to graphite (or in our case Sumo Logic), each instance instead sends its data to another collectd process. This central collectd process then aggregates the raw data before writing it out. It looks something like this:

Setting up collectd on each host was fairly straight forward: there’s lots of examples and documentation on how to configure most collectd plugins. Less clear was how to aggregate metrics on the second tier collectd process. The best example we found came from the Librato blog, where they aggregate CPU metrics together to get a single CPU metric per host rather than CPU metrics for every processor, but we wanted to build something a bit grander. So with some hints from Librato and the ever helpful collectd wiki, we set out to build our tiered collectd system.

The first step in any tiered collectd setup is getting collectd processes in the first tier to ship data to collectd processes in the second tier. Collectd ships with a few plugins that can help you do this, most obviously the network plugin. The trouble, at least for our setup, is that the network plugin operates over UDP. We perform service discovery by registering instances to load balancers with fixed endpoints, but our load balancer, ELB, only supports TCP. And since we wanted to run the second tier of collectd within our normal deployment environment that uses ephemeral instances created by autoscaling groups, we would have to introduce a new service discovery mechanism to support the network plugin.

Not wanting to increase our system’s complexity, we reviewed the other available plugins for an alternative to using UDP and found the AMQP plugin. The purpose of this plugin is not, as you might expect, to monitor RabbitMQ, but instead to ship data between collectd processes using AMQP exchanges. And since we already use RabbitMQ in production, it was simple to setup another server to broker collectd messages. The first tier and second tier AMQP plugin configs look something like this:

## first tier<Plugin "amqp">
<Publish "collectd-amqp">
Host "collectd.rabbitmq.private.adstage.io"
Port "5672"
VHost "/"
User "guest"
Password "guest"
Exchange "collectd-amqp"
</Publish>
</Plugin>
## second tier<Plugin "amqp">
<Subscribe "collectd-amqp">
Host "collectd.rabbitmq.private.adstage.io"
Port "5672"
VHost "/"
User "guest"
Password "guest"
Exchange "collectd-amqp"
ExchangeType "fanout"
Queue "collectd-amqp"
QueueAutoDelete false
</Subscribe>
</Plugin>

Update 2017–01–30: Note the use of a named queue on the second tier with auto-delete set to false. We had to add this to the config to make the system robust to unexpected second-tier collectd restarts. Without it we found that the automatically generated queues sometimes failed to delete and would continue to accumulate messages that were never processed, overloading our RabbitMQ server. Using a named, undeleted queue also allows you to process messages sent while the second-tier collectd is not running, at the cost of those messages queuing up in RabbitMQ, which for us is desirable because it eliminates any gaps in monitoring during machine restarts. Collectd will still automatically manage the named queue for you, so no need to configure anything yourself in RabbitMQ.

With the first tier shipping data to the second tier, we now needed to aggregate it. This is where the CPU aggregation example came in handy. For each first tier plugin (CPU, disk, etc.), we setup an aggregation plugin config to group together the hosts’ time series. Since we wanted to group our metrics by autoscaling group, we had each first tier host’s collectd report its Host in collectd as the autoscaling group name rather than the actual host name. Then we could group by “Host” in our aggregations like so:

## second tier<Plugin "aggregation">
<Aggregation>
Plugin "df"
Type "df_complex"
GroupBy "Host"
GroupBy "TypeInstance"
CalculateAverage true
CalculateMinimum true
CalculateMaximum true
</Aggregation>
</Plugin>

Now we had our second tier collectd aggregating our first tier collectd metrics, but unfortunately it was still writing the first tier metrics through to graphite/Sumo Logic. In a smaller setup this might be okay, but we were generating tens of thousands of metrics per minute, and most of those were individual host metrics no one was ever going to look at because we work in terms of autoscaling groups, not individual hosts. It would be nice to keep them, but for us it wasn’t economical. So we needed a way to filter out the first tier metrics and only send the second tier generated aggregation metrics on to Sumo Logic.

Collectd’s filter chains are designed specifically to aid in this scenario. They provide a generic way of defining logic about how data flows between plugins in collectd. Following Librator’s CPU example, we initially created a rule for every plugin to capture its data and send it through aggregation.

## second tier<Chain "PostCache">
<Rule>
<Match regex>
Plugin "^df$"
</Match>
<Target write>
Plugin "aggregation"
</Target>
Target stop
</Rule>
Target "write"
</Chain>

The trouble with rules like these is that you have to write one for every plugin you use in the first tier. It also blacklists rather than whitelists metrics you want to send to write plugins, so if someone turns on a new plugin in the first tier it can write tens of thousands of metrics per minute into Sumo Logic unless they also configured aggregation in the second tier correctly. To address these issues, we switched to a chain that only sends aggregated data to graphite and sends everything else to the aggregation plugin.

## second tier<Chain "PostCache">
<Rule>
<Match regex>
Plugin "^aggregation$"
</Match>
<Target write>
Plugin "write_graphite/SumoLogic"
</Target>
Target stop
</Rule>
<Target write>
Plugin "aggregation"
</Target>
</Chain>
<Plugin "write_graphite">
<Node "SumoLogic">
Host "localhost"
Port "2003"
Protocol "tcp"
LogSendErrors true
StoreRates true
AlwaysAppendDS false
EscapeCharacter "_"
</Node>
</Plugin>

Now when we turn on a new plugin in collectd we only have to update the first tier config and add an aggregation rule in the second tier. Then the second tier collectd process ships the data to a Sumo Logic agent via the graphite protocol, and Sumo Logic gives us pretty charts.

The entire setup process took us a few days, but hopefully you can setup a similar tiered collectd install in just a few hours by following our example. If you see any ways we could improve our system, please let us know in the comments. And if you’re interested in working on site reliability challenges like this for our growing marking automation platform, be sure to check out the AdStage careers page.

--

--

G Gordon Worley III
AdStage Engineering

Phenomenological philosopher, mathematician, and programmer