Logging — let’s do it right!

Tal Suhareanu
HiredScore Engineering

--

Throughout my career, I’ve been writing logs, however, what I didn’t know, is that all this time, I was doing it wrong.

So what changed?

Shortly after joining HiredScore, I found myself in a fast-moving startup. For a while, we were using the ELK stack for logging purposes. However, in order to get the most value, we needed to improve the way we utilize it. For example, our logging pipeline suffered from many issues due to its unstructured nature:

  • Logs were hard to parse in Logstash. Since we had no defined structure, we had to use multiple grok filters in our Logstash configuration. Here’s just a taste:
  • Not all required fields were searchable, so it was almost impossible to filter the logs you needed to see. You would have to go over unnecessary logs to arrive at the relevant ones. For example, you might have needed to search logs from a specific service or flow, but if there was no field designating these, it could be very hard to find what you needed.
  • This meant that developers had to search for free text in Kibana. In order to do so, you have to know what you’re looking for and write a complicated and inefficient query.

To Summarize:

  • logs are hard to parse
  • no way to filter logs by contextual fields
  • developers work ineffectively to search logs
  • hardship in creating visualizations/dashboards over distributed data

So our backend guild took it upon themselves to tackle the logging improvements task. As I began to explore the vast world of the internet, in the back of my mind I remembered: “The Twelve-Factor App” and specifically the part about logs (https://12factor.net/logs) -

Logs are the stream of aggregated, time-ordered events

For most server-based applications, things are happening all the time in the background. And by things, I mean events. Logging these events can provide visibility into the behavior of a running app. This is why centralized logging systems are used and why developers need to interact differently when using an event-based application versus a console application.

Logging is different from talking. You need to be concise and state what happened. Logging is our window into the production.

The logs of yore

I’ll start with a few examples (we’re using python and the standard logging library). Let’s say you have a transaction app and Bert wants to buy something from Ernie for 20$, you might write this logline using positional args that will yield “Bert paid Ernie 20 USD“ :

logger.info('%s paid %s %d %s', record['from'], record['to'], record['amount'], record['currency'])

Yikes. So what’s wrong here? well, first of all, it’s not easily searchable. You’ll need to use a regex, or search just a part of the log which can return results from logs you didn’t want to see just because they share the same keyword you’re looking for.

How can you aggregate logs based on that one message field? It’s easier to search by the intent of the event and then we can use sub filters to drill down to the details.

For example, we have these transactions:

a sample of unstructured logs

Now if I want to search for payments that were only made by “Bert” and are between 3000 to 4000 CAD… 🤔. Even with a nifty regex we would not be able to pull this data.

and if you use Logstash/Filebeat to collect data you’ll want to parse these kinds of logs… what a headache.

Bert & Ernie

The logs of now

Let’s try to think of this as an event:

logger.info('New Transaction', from=record['buyer'], to=record['seller'], amount=record['amount'], currency=record['currency'])

Why is it better?

We have a message indicating that there’s a new transaction.
And we have some tags for that event, like the amount, the seller, the buyer.

That way, in case of another event, let’s say from Ernie to Bert, the structure of the event stays the same, and only the values are different.

And since it’s python, you can also seamlessly pass kwargs into the event which looks like this:

logger.info(‘New Transaction’, **transaction_details)

This makes the Logstash work a lot easier — it makes the Logstash code a lot leaner (you don’t have to use grok filters and such) or you can give up Logstash all together (we did!).

Let’s revisit the payments example:

a sample of structured logs

Now that we have the metadata attached as key-value pairs, it’s easier to query the data and even build dashboards. So if I write this query:

“New Transaction” AND from: Bert AND amount: [3000 TO 4000] AND currency: CAD
easier to query and filter data

We can see I can filter results much more easily, with accuracy and with less stress on Elasticsearch.

Legacy Code — The Challenge

The code is already out there and especially logs that are everywhere. we needed a solution that will fill these requirements:

  • It should be backward compatible with our code and python standard logging library.
  • It will require little to no changes to our existing code.
  • It will support structured logging out of the box.
  • It should have the ability to pass custom keywords as attributes.

There are few ways to accomplish this — implement a new piece of code to be assimilated by developers or change how the current usage works internally.

We started with the first approach and attempted to implement a new JsonLoggingService and replace all logging instantiations and usages but it didn’t work (too much work and too little publicity among developers).

It seems like the better way now is to take our legacy logging wrapper and replace its implementation internally.

Fortunately, our code uses a wrapper (called LoggingService) to instantiate a logger and it’s used almost exclusively in our code for logging purposes.

So how do we get from the logs of yore to the logs of now?

Structlog FTW

We decided to use structlog, a python package (www.structlog.org) that integrates with python’s standard logging library. But still, it means changes in a lot of repos where logging.getLogger(…) is called. Luckily, since we use a wrapper all we had to do is replace the underlying implementation.

and, well, it offered a bunch of things we also needed, mostly:

  • ability to pass custom key-value pairs as tags — this lets us create a simple event and pass custom related data
  • processor pipeline — this powerful feature lets you set up middleware that can go over all the data and manipulate it. With this, you can filter sensitive data or act on any other information (like PII) passed so you control the data.
  • data-binding — provides you with the ability to set constant metadata in all messages that will use our logger. For example, in HiredScore we use internal metadata (context), represented by tags. This metadata can be bound as metadata to every logline.
some of the metadata we use in our logs…
  • structuring/formatting — as was mentioned before we use the ELK stack for logs so rendering our logs as JSON gives a huge benefit as Elasticsearch will chew those up no problemo.
  • and most importantly — it integrates with python’s standard logging library so it’s backward compatible with our system

Code Implementation

Let’s have a look at an example of the legacy code:

Changing line 5 to:

self.logger = struclog.get_logger()

is where the magic starts. But wait, that can’t be it, right?
right. We need to configure our logger:

now that we changed the implementation we can start injecting our own processors and add metadata to each log. If you’re using frameworks like celery or flask, you can set up a before and after task/request callbacks/signals.
For which you can define and play with your metadata.

processor factory — all of the structlog builtin processors can be found here https://www.structlog.org/en/stable/api.html#structlog.processors.format_exc_info

Conclusion

  • start writing logs as events and not as prose
  • structlog can power up your logs with metadata and tags and make log shipping a breeze
  • structured logs are easily consumed by a machine rather than a string we need to parse (see https://www.kartar.net/2015/12/structured-logging/)
  • Personally, this one sentence from structlog’s docs sold me:

“You can stop writing prose and start thinking in terms of an event that happens…“

Extra benefits we get:

  • We’re leveraging the colorful local dev format (using structlog’s builtin ConsoleRenderer) for easier local debugging:
output from Pycharm console
  • Ability to query data from Grafana (we use Grafana also for metrics)

I hope you enjoyed reading this article and hopefully, it will help you be a better developer. ✌️&❤️

--

--