Production logging tips & tricks

Published in

Pixel and Ink

4 min readApr 25, 2020

Logging is often an afterthought but it is a really important part of a production system. Whenever you have an issue in production your logs are the first thing you will likely reach for.

Tools like cloud watch, sumo logic and plenty of others make searching your logs really easy. There is a catch though, logs can get expensive so it often feels like you are trading off logs you may need against costs.

This post collects a number of learnings and techniques we have put in place to keep our costs down while ensuring we know about and can debug production issues.

We are using NodeJS, express and pino in our products but many of these tips can be applied to any stack or http server. I will not be covering the logging tools to help sort and search your logs in this post, just things your development team should do to ensure your logs do not get out of hand.

tl;dr

Use structured logging / JSON log messages
Assign request correlation ids (and handle x-request-id/x-correlation-id)
Increase log level on error
Sample log requests
Use static log requests / do not use string interpolation in log messages
Log to std out (do not log to file, that’s the process manager’s job)
We have open sourced logging helpers in Etrigan Logging, like an express logging middleware which can sample and escalate the log levels on error

Use JSON log messages

To start with you want your application to be writing structured log messages. It is much easier to start with structured information than have to parse structured logs out of unstructured text. We use pino but bunyan, winston and plenty of other node loggers enable you to create JSON log messages easily.

We even use structured logging in our Lambdas instead of console.log for the same reasons.

Assign request correlation ids

Being able to quickly collect all related log messages to a request is critical when debugging. Often the error is not enough context, you need the log messages leading up to the error.

At the beginning of request ensure one of the first middlewares you have creates a child logger and assigns a request id.

It should also check for X-Request-Id and X-Correlation-Id headers and use them over assigning a new request id. This allows you to continue request ids across service boundaries.

Increase log level on error

There is always a trade off with logging, verbosity vs volume. When errors occur we want verbose logs so we can fully diagnose the issue, and track the conditions when lead to it.

There is a Catch-22 with this though, by the time the errors happen the context leading up to it has already been ignored by your logger.

Our solution to this is something we have called ‘log escalation’. We set our request log level to warn, and our logging middleware buffers our log messages.

When an error occurs, we set the request logger to debug level, then flush the buffered logs. This way we just get info and debug level logs for requests which had errors ensuring we can debug production issues. We also only set the request logger to warn, not the service itself. Requests make up the volume of our log messages and increase with traffic, the rest of our application logs are pretty stable.

Also because the logs are buffered you need some sort of scope with a clear begin and and so the buffered logs do not cause a memory leak in your application.

Sample request logs

Most of the time requests are just business as usual and where the bulk of your logs will come from. They also increase in volume as traffic increases.

By sampling your production logs you still get an idea of what is happening in your system without incurring unnecessary cost. Before sampling our logs were in excess of 10gb a day, and that is with a decent hit rate on our CDN. Your system or site may not receive this volume of requests but it’s worth putting sampling in early so if you find yourself hitting the limits of your plan you can avoid a nasty bill.

Use static log messages

A common mistake we found in our logs was using interpolated strings for logs. An example may be something like log.info(`Article ${article.id} created`).

Doing this greatly impacts your ability to extract metrics from your log messages, or search for all instances of a particular type of log message.

Instead put any dynamic information inside the log object, log.info({ articleId: article.id }, 'Article created').

Log to standard out

We started off having our application write our log files, it them became the responsibility of our application to ensure the file had the right locks etc. But it also meant that our config approach became more complicated. The log file path was in the config, but if there was an error reading the config, that error would not be logged to the file.

Instead ensure your applications simply log to standard out and ensure your process manager is collecting the logs. It also means you will have a much nicer migration path to tools like Fargate.

The code — Etrigan logging

We have open sourced our express logging middleware in Etrigan Logging, Etrigan is a collection of packages taking care of the essentials in all our services like logging, config, feature toggling etc.

Take a look and let us know if it is useful to you. Also if you have any other useful production focused logging tips let me know.