Logging For Production
Although logging can be useful during development, developers are not the primary audience for log messages. Software Operators, or System administrators, are the primary users of log messages and potentially, they’ll be using them in high-pressure situations — the humble log file can still have a huge role in incident issue diagnosis. In this post, I’ll go through some practices that have helped me and a few things that have done the opposite!
DO — Log for Operators first
Important information needs to be easy to identify. Once identified, information needs to be accurate and easy for a person, and that person may be under duress, to understand and quickly. Be clear. Be concise. Be specific.
For example: —
There was an error while searching
This message lacks clarity because what is being searched for is unclear. Tell me what you’re searching for, and some identifying information for it.
The message isn’t concise as it could be conveyed using fewer words, without losing any context or detail. Don’t get me wrong, I appreciate good prose (or a bad pun) but there’s a time and place for it, and a log message just isn’t it. There, was, an, while can all go, and still, you’ve got the same meaning.
The message also isn’t specific, as which error has occurred isn’t logged — it is far more interesting to know SQLException or a NullPointerException was thrown, allowing you to narrow down the problem to a database interaction or a code problem than the far more ambiguous ‘error’ has occurred.
DON’T — discard error information
Have you ever suffered from code like this?
try {
doATask();
} catch (Exception ex) {}
If you have, I know your pain; debugging live issues inside a block of code like this is hellish. When something goes wrong it is impossible to tell what it is, and the author knows something will go wrong because they’re actively ignoring it. However, in this case, by doing so they are ignoring a host of other potential problems too!
This type of code is the hallmark of a novice, but this is just the tip of the iceberg for this class of problem — discarding error information is actually an easy mistake to make. The next example is the same type of issue, but manifest in a slightly different way:
@GET @Path("/user/{id}")
public User getUserById(@PathParam String id){
try {
return ResponseEntity.ok(service.findById(id));
} catch (Exception ex) {
log.debug("User not found!");
return ResponseEntity.notFound().build();
}
}
In this example, the author is clearly intending to do the right thing. They know when attempting to retrieve a user that doesn’t exist, an error is thrown, so catching that error and responding as a REST API would be expected to, by returning 404, is completely sensible. The problem is this code is catching any error and, to make matters worse, throws away all information about it. When catching an error this way, be as specific as possible. If the service method throws a ContrivedUserNotFoundException
then catch it specifically (and is there any harm in putting that stack trace in a trace log?) However, if database connectivity was lost, it would be much more useful to spam a stack trace to the logs than a message like “User not found!”.
Going back to the example, a simple fix can have a profound effect when being used in production:
@GET @Path("/user/{id}")
public User getUserById(@PathParam String id){
try {
return ResponseEntity.ok(service.findById(id));
} catch (ContrivedUserNotFoundException ex) {
log.debug("User not found!", ex);
return ResponseEntity.notFound().build();
}
}
The service still behaves the same way, except in the case where a genuine runtime error occurs. The REST endpoint will return an HTTP error code of 500 (Internal Server Error), and the cause can be logged making it far easier to monitor for, track down the cause, and eventually fix.
DO — Log identifying information
Consider the following code:
public List<Product> getProducts(String orderId){
Order order = repository.get(orderId);
if (order != null) {
return order.getProducts();
} else {
log.debug("Order not found");
return Collections.emptyList();
}
}
Even though it’s a bit old-school, it’s easy to see what this code is doing, looking up products in a particular order and in the error case when the order can’t be found, log it. In a development context, you could imagine this log message might’ve come in handy whilst testing, maybe stepping through code seeing a message pop up at the correct time. However, in a log file this might come out as:
Order not found
Order not found
Order not found
Order not found
When viewed retrospectively, this is pretty much entirely useless. Logging the order ID would be a start, but perhaps still not exactly interesting — add which user was authenticated and maybe you’re getting somewhere. Log item IDs, transaction IDs, event IDs, query parameters… just anything that might help you search for or replicate a problem given a vague description from a user. Often you can set your log pattern/format to give you some useful information, what time it is, which class you’re in, what level log, which thread, even what user is authenticated.
N.B. Do be aware it is not always appropriate to log every piece of identifying information, it might be unwise to pump logged credit card numbers from a PCI environment to a less secure ELK stack for example!
DON’T — Log too much!
Often the more information the better. However, there is a balance between the availability of information and drowning out important messages. If you normally only need an ID, log the ID and log more information at lower severity levels that might well be turned off until it is needed. Avoid logging large objects, particularly collections. Remember that a collection that is small in development might be huge in production.
Another common way to log too much is to log exceptions multiple times. The typical anti-pattern is try-catch-log-rethrow. If an exception is dealt with multiple times this way, it will leave you with multiple stack traces for the same cause. Instead, you should either allow the exception to propagate, try-catch-rethrow, or handle the error, being careful not to lose any information, with a try-catch-log-handle.
There’s a cost to every log message, perhaps in disk space, a charge from a cloud provider, or even just that an operator has to sift through it. The more you log, the more it’ll cost and even if you throw the log data away, the more you log, the more frequently you’ll have to do it, limiting perhaps how far back you can look. So log judiciously, and when you do, make it useful.
DO — Log information at the correct level
When dealing with an outage, the first thing you’re likely to do is to look for errors and warnings. So, as eluded to earlier, to make errors easy to identify, they need to be logged at the error level. But what counts as an error? You could say entering incorrect credentials is an error on behalf of the user. Would you want a stream of “ERROR: User entered the incorrect password” whilst grepping for errors?
In the context of operating software, when an error occurs, it needs to be dealt with — a service might need to be restarted, a bug to be fixed, or a change to be rolled back. For example, being unable to access the network, losing connection to a database, unhandled exceptions, threads being exhausted, or bad HTTP requests to an external endpoint are all errors that should be dealt with.
Warnings are bad, but might not be something your application can do anything about, such as an HTTP endpoint returning a server error, or external calls timing out. These can give an operator a clue as to where a problem might lie. Further down the severity tree, importance is reduced and the distinctions between levels often blur. Info might be high-level information useful to an operator and debug for a developer diagnosing problems at lower severity levels.
Try to be consistent with level semantics throughout your app, this will allow the operators to confidently turn off excessive logging in mature, stable components, and use the extra detail in newer, less stable ones. Bear in mind that rogue misleading messages can cause hours of wasted time going down rabbit holes — accuracy is still very important at every level.
DO — Maintain your logs
In Clean Code, when writing on the subject of code comments, Robert Martin asserts
Code changes and evolves. Chunks of it move from here to there. Those chunks bifurcate and reproduce and come together again to form chimeras. Unfortunately, the comments don’t always follow them — can’t always follow them. And all too often the comments get separated from the code they describe and become orphaned blurbs of ever decreasing accuracy
He’s right of course, but it isn’t just code comments that this applies to — log messages can also suffer from the same fate. Uncle Bob’s solution to the problem is to eliminate code comments as much as possible by writing the code in such a fashion that they are unnecessary. The same doesn’t apply to log statements though, removing them just removes the information available. So take care when changing existing code, ask yourself “Does this message make sense? Should I move it? Is it still relevant? Can I remove it?”
If you only take away one thing from this article, when you’re next writing a log message, have some empathy for the poor guy who’ll be reviewing the logs at 3 am to solve a problem. After all, in the DevOps world, this could well be you!
We’re recruiting
Find out about the exciting opportunities at THG here: