Inside The Black Box

Published in

Unpacking Trunk Club

11 min readJul 11, 2017

TL;DR Logging can save you untold amounts of stress when dealing with black box systems.

At Trunk Club, we use a microservice architecture. This allows us to let each service focus on one thing, organize our teams around what we do best, and try out new technologies on a smaller scale. We have services for everything from chatting with customers to getting product information. Managing a lot of services gets tricky, so we put in a lot of effort into monitoring our services. Ultimately, this allows us to abstract away concepts into services making development easier.

Abstraction is a powerful tool because it lets us ignore complexity. If I want to send a nice shirt to a customer, I can get a product from one service and tell another service to pack that product into a trunk at the warehouse. As if by magic, it just happens. How do I know that everything happened exactly as I expected? Without looking inside the other services, I’m not certain that the correct product made it into the right trunk, or if it’s even going to the proper customer! We expose ourselves to the risk of ignorance by abstracting away complexity into black box services. While logging shouldn’t be the only monitoring tool you use, you should know how to do it well.

Logging effectively can reduce the risk of abstracting complexity.

Logging Service Oriented Architectures

Microservices have a lot of benefits, but not without drawbacks. Among the drawbacks is reduced visibility. How do you know what goes on inside these microservices? What went into the black box? What came out? What didn’t make it through? Imagine if you already had the answers to these questions sitting in a file somewhere. It’s less exciting than imagining t-rex’s trying to arm wrestle, but more exciting than finding a needle in a haystack. Having a trail of breadcrumbs through your system can increase visibility and track down errors.

Sometimes we’re less engineers and more parents nagging a teenager.

Who does the logging? You. The what/when/where/why/how are the questions I set out to answer here, all in the context of microservices. This is our goal: to log the most useful information for future you to know what happened in the code. Nothing more, nothing less.

What should you log?

What went in to the black box

Seeing the actual input to your system (be it an HTTP request or background job) can be valuable. Debugging is a lot easier when you have real data on hand. Its also handy to know if someone is sending you bad requests so you can ask them kindly to stop.

What came out of the black box

Keep these questions in mind as you try to stay awake while coding:

Does this code send requests to other systems? Consider logging that payload.
Did an outgoing request return a response? Consider logging part or all of the response.
Does your function return a value? Logging the value can help reveal bugs that testing missed.

Do you need more information than what the method returned? Probably not. If you start logging more than just the output of a bit of code, then you should probably be returning that additional information as output. Sometimes we care about what happened along the way, like what data got created or what notifications went out.

This is an opportunity to create an object modeled after that output and yes, it probably will be profoundly ugly. You need to ask yourself if this information will be useful when you and your team are looking back at the history of events.

This is specifically directed at what was actually returned from the code. If your code exits early (unexpectedly or otherwise), that’s covered NEXT!

What didn’t come out & why it didn’t

The most important time to log is when an event doesn’t happen as expected.

Why didn’t anything come back? Is it something I said?

When things are working properly, that means that input and output are happening. When nothing happens, that’s when people start panic-asking “did it work?” “is it running?” “is there a bug?”.

Any time you filter messages out, log why it was filtered out.

If something unexpected happens along the way and errors begin to bubble up, you start leaving the realm of logging. So yes, log that an error happened, but then let your error handling framework do its job (we use honeybadger for this).

Other times, we have valid reasons for output not to occur because we purposefully filter messages out. This can happen for so many reasons. I’m going to list a few for dramatic effect:

Data validation — can’t work with bad data!
We already performed an operation so we won’t repeat ourselves
For some odd reason, we don’t do this operation after 7pm
We arbitrarily drop every third request because 2 out of 3 ain’t bad!

This is an issue when we filter things silently. Sometimes these validations justify raising an error. For less vital validations, just a log message should be enough. Now we have a reference of why that thing didn’t happen.

When you have a large amount of validations, it can be just as useful to know the reason a record did make it through as opposed to why it didn’t. Dramatic list powers, go!

The record was in our whitelist of acceptable values!
We had a temporary batch of records that were allowed through a block that they are usually filtered out from.
The first dozen requests were missing required data, but this request had every required field!
This request was a picture of a war corgi and thats so silly we let it through.

Make sure that the reason accurately describes why a record was filtered out. Maybe you have a huge nested validation. If you are performing a number of different checks, then you should have a separate reason for each individual check you perform. It might seem like overkill, but this is actually one of the most valuable times to over-log. The best test of this is taking one of the filter reasons and tracing that back to exactly one place in your code.

What might have happened along the way that you care about

While the purpose of abstracting operations into a black box is to avoid having to know what goes on inside, there are times when we do want to know what happens.

Uh oh, did we launch the yukon golds or the russets?

What flows were taken through the box

Sometimes we add new versions to our code, or new paths through logic. When you add in that new version, how do you know that it is being used? Might not be a bad time to throw in a log message that says “Hey, this process went down the new path in the code!”

Let’s say I’m adding a new feature for only users in a given role. When I add the new flow, I’ll typically turn it on for myself and a few others first to make sure there aren’t any gremlins.

Then, when things seem to be working, I go check the log to make sure I am actually using the new code. This increases my confidence that the new code is executing correctly.

When debugging code with a large number of paths, knowing the route that was taken narrows your search down considerably.

Connecting Threads

Lets say your request needs to stop by a couple of other services for some information on its merry way. During one of its stops, something goes horribly wrong and your request dies.

Yes, logging can help you spot this tragedy. How do you tie this error to the request that came through? For that matter, how do you tie all your log events together?

One very successfully pattern I’ve seen is to pass along an identifier for a given request, and stick that in every log event along the way. Now your events are tied to your request! Ideally this isn’t something that you have to manually pass along, rather something that gets done by whatever glue keeps your services together.

What shouldn’t I log?

Don’t log too much

Not everything needs to be logged. Or at least not all the time. Consider tuning your log level in different environments. Logging every standard event in your flow as a high level alert is a no-no.

Too much noise makes developers ignore the information all together.

Too much noise hides valuable information. Try to minimize the number of fields logged, this will make your logs easier to read with the naked eye. You probably don’t need to know what the weather was like when your code ran, nor what was being served for lunch that day.

Personally Identifiable Information

Its easy to go crazy logging. Aside from avoiding log events for the sake of logging, there are other things not to log. Logging personally identifiable information puts you and your users at risk for identity fraud.

Do not log personally identifiable information.

Do not log personally identifiable information.

NEVER log personally identifiable information.

What is PII? This is PII.

Your country may have different laws regarding PII, make sure you are familiar with them.
Your company may have a guide regarding PII. Follow it.
Do not log credit card information. Included in PII, but worth calling out on its own. If you can avoid storing credit card information at all that’s even better.
Do not log passwords. Again, don’t even store them if you can avoid it.
Do not log secret keys. Don’t log emails. Don’t log phone numbers. Don’t log names or addresses.
Go read the examples section on that Wikipedia page. Seriously. Right now. It will only take a minute to scan that list.
If this is the first time you’ve heard about PII and your company 1) has this information and 2) doesn’t do anything to protect it, you should probably get on that. I’m neither a lawyer nor your mother, but its worth discussing with your team.

When should you log?

Much of this is going to sound similar to “what” you should log, so I won’t be going into much detail. These are all times when a little voice in your head should chirp “you should probably be logging this.”

New endpoints are being added or changing flows. So you can tell when the new code is being used.
Validation logic that might filter out data.
Events that get generated based on your current flow. This will give you proof that they happened.
Extremely long blocks of code. Signposting can help you diagnose failure.

I can tell how much you loved that list. Here’s another, just in case you ran out of things to ask yourself in the shower:

Do you need full logging on in every environment?
How often are you tracing through failure in production?
How easy is it to duplicate in a test (or dev) environment?

How? Where?

What method do you use to write to your log files? Where do the logs live?

We write to static text files on a server. Our weapon of choice to search through them is Splunk. How should you do it? My tl;dr answer is “Whatever you can search through easily.”

The greatest value you can get from a log is being able to quickly determine what happened in the past.

If it takes 10 minutes to open your log file and another 5 to search through it, it might be time to invest in tool.

These “how” and “where” questions are usually made at the architecting step of an application. Many developers will find themselves joining a project long after the decision has already been made. End of the day, it usually boils down to some logging library that dumps data into static files sitting on a server somewhere. I’ve seen database tables used instead of log files, custom homebrew libraries instead of the enterprise’s latest logging package, but at the end of the day history just needs to be written. Typically to a directory labeled /log.

Storing the logs is a battle all its own. Depending on the size of your projects, the size of the logs might not be trivial. You have to rotate them, monitor their size, make sure they’re not corrupted — luckily you don’t have to feed them. Yes, big logs can also mean big storage costs. It’s worth it to have recent history readily available. Maybe go hire a log lumberjack.

Why?

In the end, what does all this logging get you?

Visibility

Primarily, visibility. It lets you see the actual state of your code when it was running. What code ran, who ran it, the number of times it ran, etc. You can use this information for metrics on traffic, performance, error rates — you name it. It gives you a better view of the past than you would get from archaeology or private investigation — but you don’t have to dig in a desert and you don’t get to dress up like Sherlock Holmes. Or maybe you do, I don’t know your dress code.

You can leave yourself the trail of breadcrumbs you would like to find when tracking down your next pesky bug.

Reassurance

When something breaks, it is reassuring to know that you can trace through what happened. Pinpointing an issue can take as long or longer than fixing it. Good logging is the first step to reproducing an error and fixing it.

Did you hear that? That means finding and fixing bugs will cost less time and money.

Finding and fixing bugs will cost less time and money

Finding and fixing bugs will cost less time and money

Finding and fixing bugs will cost less time and money.

Ok, glad you got that.

Confidence

Confidence in your system is key. You can build reports or dashboards on your logs to monitor your systems live. This is one more safety measure you can add to your deployment checklist to increase the confidence of success. Deploy goes out and your dashboards start looking like a seismograph means you might have a problem.

Maintaining those logs can cost some money. Paying for an app to make it easy to search through can cost a lot of money. Think about how expensive it is to have your developers search through logs the hard way. More importantly, think about how expensive it is to keep hiring new developers after yours burn out staring at log files in a terminal all day…scrolling…scrolling…scrolling…

Trunk Club gets untold value from effective logging. We use it every day to monitor new features and to pinpoint areas of improvement in our systems. As we move quickly and change a lot of things, logging improves our confidence that we can recover quickly from anything that goes wrong.

As our systems grow in size and complexity, we abstract out nitty-gritty details because its easy. Thats a calculated risk, and often those nitty-gritty details bite us in the butt. Logging effectively reduces the risk of abstracting complexity.