Photo by Luke Stackpoole on Unsplash

Every Error is an Opportunity

Engineering Insights

Talin
Published in
5 min readNov 25, 2020

--

Experience is what you get when you didn’t get what you wanted.

One of the ironic facts of life is that we generally learn more from failure than from success. While it may feel like a bitter pill to swallow at the time, one of the things that you start to appreciate as you get older is the chance to have what can be charitably called “a learning experience”.

This is especially true when it comes to handling software errors.

In my experience, error handling code in large apps is often in a state of neglect (a characteristic it shares with accessibility and localization, all of which have a tendency to be treated as afterthoughts). Most programmers focus on the “happy path”, that is, the code that handles the case where everything went right. Reading the code for typical a large application, one is often struck by how little effort has been spent on error handling. In many cases the author will simply log the error and then continue as if nothing happened.

Even in cases where the programmer has spent considerable effort to do proper error handling, that code is often poorly tested. There will of course be unit tests, but these alone are insufficient. Such tests are always going to be artificial in some sense. Errors are, by their very nature, somewhat unpredictable, so any scenario that attempts to simulate an error will contain assumptions about the frequency and format of errors that may not match reality.

End-to-end tests have a particular challenge when it comes to errors: many of the code paths for error handling are simply not accessible in an end-to-end test. If I am testing a database client, there is no way that I can trigger a “disk full” or “network offline” condition through the client UI. Such conditions can only be provoked by drilling down through the software stack, in a way that is very different from what a real customer would do.

This is why is can be incredibly valuable to “capture” an authentic error in the wild and examine it, especially if it is one that is reproducible.

While developing an application, one will often encounter actual errors — natural errors that arise from real problems, not artificial ones generated by a test scripts. In such cases, there is a strong temptation by the programmer to fix the problem and move on — to make the error “go away”.

But this is a wasted opportunity. This error may be executing code that has never been run before, or only been tested lightly. You may not have many chances to witness how well or how poorly your error handler is working in a real error situation.

When you encounter such an error (assuming you’re not in a tearing hurry) what you might consider doing is not to try and fix it right away; instead, try and improve the way that the error is handled. Only when you are satisfied that the response to the error is of sufficient quality, do you then attempt to address the root cause of the error. Otherwise, you’ll lose your opportunity to see that code run.

What is meant by “quality” error handling? The answer to this question derives from the fundamental purpose of errors and error handling.

Errors are part of an interactive process involving the user — a feedback loop. The ultimate goal is for the user to take corrective action in response to an error. Thus, a “good” error is one that contains enough information and context for the user to repair the fault. A “bad” error is one that leaves the user stuck and confused.

So, when a coder encounters a “network down” error during development, rather than simply fixing the network problem, it is a good practice to spend some time ensuring that this “network down” condition is communicated to the user in a clear and effective way.

Unfortunately, error handling is not so simple because errors can happen anywhere in the software stack, including parts of the code that are far removed from the customer UX. An error which happens at the level of a disk or database may not be meaningful to the customer. The customer isn’t aware of the low-level details of the implementation; conversely the low-level code may not be aware of the high-level context of what the user is doing. A function to open a file may not know why the file is being opened — it could be saving a document, writing to a log file, storing preferences, or reading a configuration.

Errors typically happen at the bottom of the software stack and bubble upwards until they are presented to the user. Along the way, the error will pass through several layers. Each layer has an opportunity to transform the error or add additional context.

Ideally, by the time the error reaches the user, there will be enough additional information to be able to clearly express to the user what happened in terms they can understand, and perhaps even offer suggestions on how to recover. “Disk is full — to make more available space, try deleting some files.”

It’s also important to recognize that the user who is seeing this error might not be a native speaker of English; I’ve seen all too many server errors which had no discernible error code other than a non-localized text error message. Proper formatting and display of errors is the job of those layers of the stack that understand locales and internationalization; the job of the layers lower down is to provide sufficient context to be able to do this.

However, the end user is not the only consumer of errors. There are others: engineering and customer support for example. Engineering wants to know which errors are happening the most frequently, so that they can improve the design of the product and reduce the likelihood of such errors. A robust logging and telemetry system may be part of the solution. Customer support also wants to understand the details of an error, in order to assist customers with their problems. Having access to specialized internal knowledge of the product, they might glean more information from a carefully-worded error dialog than a customer might.

A “properly handled” error is one that addresses all these needs and considers all these potential audiences.

So the next time you see an error in your application, don’t get frustrated or mad. Instead, be grateful. You’re about to have a learning experience.

See Also

--

--

Talin

I’m not a mad scientist. I’m a mad natural philosopher.