A Few Good Errors

Engineering Insights

Talin
Machine Words
Published in
7 min readJan 27, 2019

--

Three things in life are inevitable: death, taxes, and software errors. The challenge for a software engineer is how to handle the inevitable errors gracefully.

While you can (and should!) strive to eliminate errors in your code, perfection is an unreachable goal; you are only human, after all, and some errors are outside of your control. There’s not much your application can do if the hard drive crashes or the network is down.

Given that errors are a fact of life, the question is: how should they be handled?

Life of an Error

The process of recovering from an error is, in most cases, a human problem rather than a technical one. While there are such things as “fault-tolerant” systems that can act autonomously to recover from an error condition, in most cases the process of handling an error means communicating to the user somehow. In other words, we’re throwing the problem into the user’s lap and saying “I give up, tell me what to do next.”

How exactly that should happen depends on the nature of the user and the product. Error handling is a fundamental part of software product design; it should not be tacked-on as an afterthought. And, as with any product design, you should start by understanding who your audience is. For example, if the user is a server administrator, the errors will be presented in a way that is very different than, say, a musician working with a digital audio workstation.

Even so, there are some common elements that span the entire gamut of application domains. The key one is that error handling is an interactive process; users respond to errors by taking action. This abstract “user story” looks something like this:

  1. The program attempts to perform some operation, possibly triggered in response to user input.
  2. The operation fails, signaling an error condition.
  3. The user is notified of the error via an error message.
  4. The user reads the error message.
  5. Based on the information in the error, the user plans and executes some sort of corrective action to overcome the problem.

Key insight: a “good error” is one that facilitates this process, that helps the user successfully complete the last step: recovering from the error. Conversely, a bad error is one that leaves the user confused or stuck, unable to proceed.

Good errors communicate what happened in terms that the user can understand, and possibly suggest what they should do next. “That user name is already taken; try choosing a different one.”

Unfortunately, doing this is not so easy, and the reason is because of context.

Upward Propagation

Most errors are first detected at the bottom of the software stack, down at the level concerned with low-level operations of writing files and doing basic calculations. User typically aren’t familiar with operations at this level, if they are even aware that they exist.

What the average user really cares about are high-level operations happening at the level of the user interface. They want to know if their work got saved, their document got printed, their email got sent.

But just as the user has no knowledge of the low-level details of the program, often the low-level code has no idea what the user is trying to do! A function which is responsible for writing data to a file might not know that it is part of a “save” function (as opposed to a “set preferences”, “export” or “store in cache” function, or any other operation that involves writing files.)

Thus, errors need to propagate upwards from the lowest layers of the software stack to the highest, and as they do so, the error object needs to be transformed: low-level errors get translated into high-level errors that the user can understand, often by adding additional context information. This works because each layer of the software stack that the error rises through understands the context of the current operation, even if the layers below it do not.

The end of this process is a message displayed to the user. For a desktop or web application, this will often be in the form of an error dialog or alert.

Internationalization

Note that at some point in this process, we’ll need to account for the fact that the user might not be a native speaker of English (or whatever language the program author uses). So the error message will have to be translated into whatever language corresponds to the user’s current locale preference.

Again, the low-level subsystems might not have any knowledge of the user’s locale. For example, it’s common in single-page web applications that the user’s language choice is a browser preference setting, unknown to the server. In such cases, the language translation can only happen in the client.

A common mistake that I often see is server programmers generating error messages that are English strings, expecting those strings to be displayed to the user verbatim. In my own work, we have a strict rule: no readable text comes from the server API unless that text was input by a user (because presumably the user is able to write in a language that they, or their chosen collaborators, can understand). Instead, API methods are required to return error codes, and the JavaScript front-end is responsible for translating those error codes into messages.

This does not mean that those error codes will never be read by humans however; because you see in addition to the end user, there is a second audience for errors: developers. When debugging an error condition, it’s much easier for a developer to know the meaning of “duplicate-username” than “Error 1459”. Thus, even though your error codes are not English sentences, it’s best if they still can be comprehensible to a human reader.

Comprehensiveness

A critical problem with error handling is that it’s not possible to effectively handle all of the possible errors for an operation if you don’t know what all the possible errors are.

Modern programming languages and frameworks make it extremely easy for programmers to invent new types of errors at a moment’s notice. For example, in Python, all you have to do is derive a new subclass of “Exception”.

But what most programming languages and API specifications don’t provide is a way to automatically discover the list of all possible errors that could be produced by a given operation. (In fact Java tried hard to make this happen with their “checked errors” concept, but in practice it was cumbersome enough that many programmers ended up working around it.)

Worse, many popular frameworks are designed such that low-level errors are propagated upwards without any translation or modification, despite the fact that the different layers generate errors that have different and incompatible formats. For example, one web server system I worked with would produce errors with completely different structure, depending on whether the error occurred at the database layer, the application layer, the data validation layer, or the request routing layer. That is, each of those four layers had a completely different idea of what an error looked like, and those errors would be transmitted directly to the client. This made the client-side error handling code extraordinarily complex.

Solving this requires a great deal of diligence on the part of the developers responsible for generating the errors. Basically you have to document every possible error that can be generated by each individual operation. This is a lot easier if you plan up front how errors are going to be handled, establish some basic standards early on, and then stick to those standards.

Recovery

As mentioned, the final stage in error handling is for the user to take action in response to an error. Since most users aren’t technical experts, it is important to get the messaging right so that they can understand what happened, and more importantly, what they should do next.

It can be helpful if an error message includes suggestions for possible remedies. If a file cannot be saved because the disk is full, you can suggest that the user free up some space before trying again.

However, you have to be careful not to give the user bad advice. The remedy that you suggest may not work; whatever condition caused the error might only be a symptom of a deeper cause that you know nothing about.

You also don’t want to overwhelm the user with a lot of technical detail.

In some cases, an the user may need to seek additional human assistance. This introduces yet a third audience for errors: customer support. End users will often include error details in their request for help; this may even be in the form of a screen shot. Ideally, there will be sufficient clues in the error message for the customer support agent to be able to deduce what has gone wrong, although those clues can be subtle so as not to distract the end user. For example, an error message may phrased in a way that is distinctive from other, similar errors.

Conclusion

Error messages are an important part of application design, and are part of a lifecycle process which results in the user being able to recover from the error. How errors are handled and presented can have a large impact on whether this process will be successful.

See Also

--

--

Talin
Machine Words

I’m not a mad scientist. I’m a mad natural philosopher.