Nobody likes errors.
Yet errors still exist. Segmentation faults happen, blue screens are still a thing, and APIs bounce requests with 5xx status codes. Nobody reasonable offers 100% uptime.
Most of the time though, it’s unlikely you will encounter errors when testing new APIs. You might not even experience them until shipping to production. Hence, it makes sense for service providers, including Pusher, to optimise happy paths in their products.
But too much optimism can hurt — errors are not going away and ignoring them creates negative feedback loops.
Everything starts with one bad error. The next developer working on that module ignores it and makes a quick change without refactoring the error handling logic. The following developer does the same. This cycle continues for a while, the technical debt accumulates, eventually leading to…
Application encountered unexpected error: no error.
How can we break this vicious cycle? Can we approach software errors and happy paths with equal respect?
At Pusher, we believe errors are features. We think they deserve as much attention as happy scenarios.
To explain this philosophy, we’ve translated it into five principles of error design:
- Optimise for client simplicity.
- Design errors with handlers in mind.
- Abstractions should reduce the rate of errors.
- Fail early, fail safe.
- Provide actionable error details.
Not only do they reduce the frustration of dealing with errors, they also improve the reliability of our services.
1. Optimise for client simplicity
Pusher aims to encapsulate complex communication problems into services with much simpler, more manageable interfaces. This philosophy of shifting complexity towards servers reduces the amount of work for our customers, helping them achieve more with less code.
Following that rule, we must keep client error handling behaviour simple. We need to reduce the complexity in interactions between clients and servers to the minimum. Keeping protocols and API contracts simple is crucial.
This principle provides several benefits.
First, the service has one backend implementation, but it’s used by many client libraries. We can save plenty of engineering time by reducing the amount of logic on the client side.
Second, servers usually know more about the origin of failure. They can provide better guidance for clients, reducing the time to bring the system back to a functional state.
Last, troubleshooting complex network interactions is difficult. Encapsulating that logic into a component with a simple interface makes debugging easier for both client and server maintainers.
2. Design errors with handlers in mind
Most errors are developed within the layer that encounters the failure. There’s a class that extends StandardException, an if condition that handles the edge case, or a throw statement. Unit tests pass. Job done.
Developers forget that someone has got to handle that error.
We must design errors with their receiver in mind. Determining proper abstractions to handle them is difficult, though.
Which components care about the error? Will they receive it directly or via other layers? Will they remediate the failure or escalate it? Will the application display the error to the end-user? How?
Finding and answering those questions is difficult, but it benefits everyone in the long term.
3. Abstractions should reduce the rate of errors
Good abstractions make systems simpler. We don’t encapsulate components to make the resulting interface harder to understand. At least we try not to.
I found error rates to be a great guideline for judging value of abstractions. The best abstractions yield no errors, but distributed systems can’t operate that way.
For a single error category, the rate can:
- decrease, by filtering errors
- stay at the same level, by passing all errors further
- increase, by amplifying errors
Abstractions that decrease error rates are useful.
Abstractions that raise error rates signal problems.
Some abstractions don’t handle errors at all, and that’s ok. For example, transforming values between formats — assuming the mapping can’t fail — doesn’t change the error rate, but can still be useful.
4. Fail early, fail safe
Each request to Pusher goes through several layers of our stack. First, it hits external load balancers, then internal proxies, the service itself, eventually turning into a database operation.
In a well-designed system, each abstraction layer can dig deeper into the request. Edge proxies, such as AWS ELB, work on the TCP level — they don’t understand HTTP, WebSockets, not to mention the service logic. Our intermediate proxies speak HTTP, but they are oblivious to service logic.
Abstraction layers might introduce new side-effects too. When a request fails before reaching an internal proxy, we know it hasn’t touched the service’s database. Conversely, when the service disappears between accepting the request and returning the response, we can’t guarantee the service hasn’t modified its database.
It’s important to terminate requests early to stop requests from causing inconsistencies and incurring processing costs. Each error should come from the outermost layer that has enough context to handle failure. For example, TCP load balancers can’t throttle request rates, as they don’t understand HTTP. However, our internal proxies support HTTP and know appropriate rate limits, so they can throttle the number of processed requests.
This approach increases the number of safe-to-retry requests, letting clients recover without sacrificing consistency of data.
5. Provide actionable error details
Let’s be honest, we can’t handle all errors. Sometimes invalid data will make its way through the SDKs. Sometimes clients won’t be able to resolve inconsistencies without human intervention.
In those situations, it’s important to provide enough human-readable details about the error condition. Class names are not enough. People need error types, descriptions, parameters, or even better, steps to fix the problem.
Those details must be actionable. If a user submits invalid data, the service should provide steps to correct it. If a request is malformed, the error message should state that it’s most likely a client library bug and the developer should report it to the maintainer.
However, it’s crucial to avoid unnecessary details. If our databases are down, the user can only retry the request later. They shouldn’t know there was a connection failure between an internal service and its database. They need to know the error was temporary.
We need to design protocols, APIs, SDKs, and other tools to always provide that information. Failing in this area means developers will escalate problems to support, or even ditch the service out of frustration.
Love thy errors
On paper, errors barely exist. In reality, they can define the developer experience as much as actual features.
The five principles outlined above define Pusher’s error philosophy, but they match our approach to designing all features for our realtime platform:
- We optimise for client simplicity.
- We design features with end-users in mind.
- We add abstractions to reduce the amount of work downstream.
- We try to handle business logic as early as possible.
- We also provide actionable documentation for features.
APIs should work for their users. Even when they don’t work.