Log sampling with Papertrail

Tarjei Husøy
Megacool Blog
Published in
4 min readAug 9, 2018
Screenshot from Honeycomb showing a latency spike

At Megacool we use Papertrail for log management, and we’re quite happy about their offering. The online viewer is easy to use for a quick glance into what’s going on, we can easily use the CLI to do more fancy filtering with grep and the other standard CLI tools, and it’s reasonably priced. We’re already frequent users of structured logging, using logfmt to format all of our logs. A sample log line looks like this:

request_id=99095c25f801dffd2acb53ed490d87c1 method=POST path=”/events” status=201 total_time=0.165

We find that this gives a good balance between human and machine readable logs. This is also easy to work with using hutils, thus to look at success rates for a given endpoint you could run a command like papertrail | lcut status path | sort | uniq -c | sort -n to get a crude overview of your endpoints:

 1 401 /events
12 400 /events
53 200 /shares
90 201 /events

We love using Honeycomb for most of these kinds of queries since it visualizes trends much better and enables alerting (and much more), but we still like having logs around. However, with decent traffic levels it’s easy to quickly end up with huge log volumes, which forces you up to pricier plans and everything becoming slower due to the volume. To combat this you can skip logging certain events, or you can shorten each individual log line, but both of these diminish the value of the service. Papertrail has support for inbound filters, to drop requests before processing, but the filters are all-or-nothing, and we want to always have some output from all of our services to know they’re still present. Thus we’d like to have some kind of sampling, enabling us to say that a particular filter should only let 10% of the log lines through, ignoring the rest. While Papertrail doesn’t support a sampling like this natively, we can achieve the same effect with their regex filters.

We can build this filter since we have the request_id field in our logs, which is a pseudo-random 32 character hex string. The presence of this field combined with the easy greppability of logfmt enables writing a filter like this:

request_id=[2–9a-f].* method=POST path=”/events” status=2

This filter only lets through successful requests to /events with request_id’s that start with 0 or 1, effectively only processing 12.5% of the incoming logs. If you need a higher sampling rate, you can require a longer run of characters, like request_id=([1-f]|0[1-f]).* , which only let’s through request_id’s with two consecutive 0’s, effectively setting a sampling rate of 0.4%.

A benefit of using a request id like this for filtering instead of random sampling is that as long as all components in your system includes the request id in their logs, you’ll have logs from all components for a given request. Thus if a request touches your edge servers, your auth service and your api server, logs from all three will always be present for a given request that was let through the filter.

While tweaking our setup, we found it useful to use the line-mode of pipe viewer (pv) combined with Papertrail’s CLI to estimate the impact of different filters. To estimate the impact of the filter above, first establish the current logging rate with a no-op request_id filter:

$ papertrail -f \
| egrep 'request_id=[0–9a-f].* path="/events" status=2' \
| pv — line-mode — average-rate >/dev/null
[ 20 /s]

Let this command run for some time until the count starts stabilizing. Then, adjust the filter to get a volume you’re happy with that fits without your quota:

$ papertrail -f \
| egrep 'request_id=[0–2].* path="/events" status=2' \
| pv — line-mode — average-rate >/dev/null
[ 3.6 /s]

Just remember that this query shows what is let through, not what is actually blocked. So when you’ve found a sampling rate that gives a volume you’re happy with, flip the condition in the filter around, ie the request_id=[0-2] above should become request_id=[3-9a-f] as a Papertrail filter.

It’s worth mentioning that these filters assume that the ordering of the fields is consistent. Heroku’s router logs like this, it’s easy to modify nginx to do the same with a logging format like request_id=$request_id method=$request_method .., and you can use our flask-events helper tool for Flask to create logs like these from your Flask application and also submit the data to Honeycomb.

This technique has helped us at Megacool keep our log volume under control. Do you have any other logging tricks up your sleeve? Let us know!

--

--