We Need to Make Our Logs Useful

A Lesson in Observability

Saif Uddin Mahmud
Dabbler in Destress
6 min readSep 1, 2021

--

Photo by Radek Grzybowski on Unsplash

Trawling Through Crap

We get used to calling print() to debug code meant for one-off school projects and MVPs. We may appreciate the usefulness of dedicated Loggers and Debuggers with time, but the unstructured logging habits carry over to production. And someday it bites you really bad…

It bit me when I was investigating my first bug-in-production at Klinify. Something had gone wrong with a large request, triggered timeouts across the board and the Frontend started spamming the same request. This overloaded the Backend and causing performance issues. Our production infrastructure’s distributed across VMs, Azure-managed services, and a Kubernetes cluster. Azure Log Analytics (ALA) was the tool of choice when sifting through logs pouring in about a few hundred lines per second.

I was fighting ALA, trying to narrow down my query to find the problematic lines. There were a few problems making the process difficult, frustrating, slow, and imprecise:

  1. Logs from different services had different structures and nobody had documented this.
  2. It was difficult to discern the source service of a log entry. The only way to do that was to look at the ID given to the pod by AKS — which did not have the image name!
  3. Once a request hit the Backend and bounced around from service to service understanding the causality became difficult. It was a wild game of checking timestamps and hoping the clock drift was minimal.
  4. There was a lot of noise. You had to tune every query to minimize the false positives and false negatives.

TLDR; it was not a fun day to be trawling logs.

Why This Happened

You might be thinking, “What a bunch of scrubs. I’d fix the logging problem before it got this bad!” I doubt it. If that is what you see in your organization, it is because it bit them too at some point. Congratulations. Regardless, it’s worth exploring why this happens in the first place.

The scenario is a manifestation of technical debt. It creeps up on you because they’re never a top priority in the engineering team. Such problems don’t happen every other day, so I could have fixed the bug and moved on. And I’d dread the day I had to trawl again. These are the kinds of not-so-sexy projects that have the largest ROI on your DevX. Now that our Backend team had extra firepower (aka me), we could afford to focus on this. The rest of the article documents what steps we took to achieve this goal.

Standardize Your Log Structure

I’ve pasted Klinify’s log structure from our internal wiki. We’re going to unpack parts of it below and follow the reasoning. What I want you to take away from here is that it’s crucial for all your logs across the board to look the same. So you’ll know exactly which fields to extract, and which fields to filter within your query.

Define Logging Conventions

The developers need to know exactly what to log. Leaving this decision completely to the Devs does 2 things:

  1. It creates mental overhead because they need to decide what to include/exclude.
  2. It increases the chance of missing out on important lines and including unnecessary ones.

Have some rough rules of thumb in your internal docs. The following is an excerpt from our docs. Your exact conventions will depend on your context:

Another example of decisions you need to make is what each of your levels (INFO/WARN/ERROR/CRITICAL) should be used for.

Once you specify and implement your conventions, the trawler can make certain assumptions and decrease time-to-spot-bug.

Tie Together Your Logs

You need to include a mechanism for this, especially if your architecture is service-oriented. All our requests — both from clients and internal services — are tagged with x-klinify-request-id and logged under request_trace.request_id. Now, whenever we need to investigate a certain request, we filter with that request_id and the chain of causality appears right in front of us.

Include Metadata for Quick Filters

The pod_metadata tells you exactly which service and pod the log is from, what version the pod is on, and the commit hash. If you have a bug you know the code version to look at, which is useful during rolling releases. You will also know if a bug is manifesting in a specific pod (OOM and uneven load distribution may cause this).

To add these in, we mandated the use of git tags. You can then write shell scripts or integrate the metadata within your CI/CD pipelines to inject them into the pod. For example, this appears in our Dockerfiles:

And this appears in our K8s deployment files:

Define a Logging Library

Instead of reinventing the wheel for every service, you should define a logging library. The library should enforce the logging requirements, and make it super easy to do the right thing. The following is how we use our logging library in the codebase:

That’s it! The library handles all the metadata, structuring, validation, etc.

Reduce The Noise

Most of your noise should be gone by now if your conventions are well defined, but it’s worth taking another look at. Look at your logs and see which lines you don’t need at all. Drop them. For example, a simple fix we made was not to log uwsgi 2xx/3xx. This slashed our log volume and saved us quite a bit of money! All it took was the following 3 lines in our uwsgi.ini:

Define Reusable Queries

You can store functions right where you query your logs, or in your wiki for easy copy-paste. Define the most common queries for reuse. Don’t reinvent the wheel every time you need to trawl logs.

Photo by ActionVance on Unsplash

Results

  1. I don’t dread trawling logs anymore. Since I don’t have to fight my tools, it’s actually become quite fun.
  2. You feel like batman because you can acquire evidence quickly.
  3. Time-to-spot-problem has gone down dramatically. We’ve had the chance to stress-test the new logging structure during outages.

Now that you’re armed with this wisdom, I hope your log trawling journey is much smoother than mine. Go look at your codebase with a new lens. What can you improve?

Special thanks to Yao Wei for working on a massive chunk of this problem and making my work way easier!

--

--