Handling Web Service Failures Gracefully
Web services fail. It’s a fact of life — even world-class teams have trouble with their systems from time to time. Given that failures are unavoidable, a well-engineered website should degrade gracefully in the presence of failures with minimal impact on the customer’s experience. Graceful degradation is particularly important for an e-commerce site like mancrates.com, where every second of downtime affects the financial bottom line.
A modern e-commerce site that ships a physical product (as Man Crates does) depends on several web services, such as:
- Payment processing
- Tax calculation
- Address verification
- Shipping quotes
- Social integrations (Twitter, Facebook, etc.)
This post describes our take on handling web service failures at Man Crates.
What is a failure, exactly?
Surprisingly, it’s not trivial to determine when a web service is down or failed. It’s easy to assume that if an web service is down, it’s clearly down — i.e., a total failure. During a total failure, the web service returns an error for every request made.
However, that isn’t always the case for every failure. Here are some other common scenarios:
- Partial failure: The service responds with a mix of error and success. If the error rate is high enough to be noticeable by a customer (say, 20% or more), then the service has effectively failed.
- Performance failure: The service responds, but lethargically. If the response is slow enough to be noticeable by a customer, then the service has effectively failed. A huge problem with this type of failure is that it ties up the site’s resources (memory and/or threads) while the site server waits for a response from the slow service.
It’s important to consider these scenarios and decide which to include when analyzing your site’s failure tolerance. Doing so establishes the failure model for your site.
At Man Crates, total, partial, and performance failures are all part of our failure model.
What should a site do when a service fails?
To begin, we looked at the components in our site that used web services and considered each usage, always keeping in mind how a failure would affect the customer’s experience.
We found that the best way to handle a failure often depends on the specific situation. Here’s a breakdown of web services we use at Man Crates, and what we do when each service fails:
- Credit card payment processor setup: Fail over to a secondary credit card payment processor.
- Credit card form submit: Display an error, fail over to a secondary credit card payment processor, and ask the customer to re-enter their credit card details.
- PayPal setup: Skip displaying the PayPal button.
- PayPal form submit: Display error and suggest an alternative payment method to the customer.
- Address verification: Fall back to simple address verification, verify address later in a background job, and notify the customer via email if there are problems.
- Shipping quotes: Display shipping flat rates to the customer instead of realtime shipping quotes.
- Tax calculation: Skip displaying the estimated tax at checkout, and retry fetch taxes later in a background job.
- Social integrations: Skip displaying social integration actions or links.
Our analysis found that there are several different approaches to dealing with a failed service:
- Fail over to an alternative web service
- Retry later in background job
- Display a message to customer
- Skip entirely
- Fall back to a local method that doesn’t rely on web services
Given the diversity of approaches, an site-wide system for dealing with failures wouldn’t work well. Instead, we focused on cross-cutting concerns: logging, monitoring, policy, and detecting faulty services.
Detecting failures with the circuit breaker pattern
A circuit breaker is a proxy object that intercepts calls to a web service and monitors its status by recording the error rate over time.
If the error rate is normal, then the circuit breaker is in the “on” (or “closed”) state, and the breaker allows requests made to the web service to execute as normal.
If the error rate rises above a configured threshold, the circuit breaker enters the “off” (or “open”) state and disables the web service. While the service is disabled, the circuit breaker immediately returns an error for every request made to the service.
After some time has elapsed since the last failure, the circuit breaker re-enters the “on” state, and resumes allowing requests through to the web service. If these “test” requests fail, then the circuit breaker turns the service “off” again.
A circuit breaker allows the app to skip, avoid, or fail over from services that fail intermittently.
Our take: Circuit breakers at Man Crates
At Man Crates, we apply the circuit breaker pattern whenever a call to a web service is made from our site. To help out, we have a small library,
circuit, that implements the circuit breaker pattern.
circuit also logs web service calls, web service errors, circuit breaker state changes, and performance metrics (e.g., the elapsed time for each call).
Below is an example that normally uses EasyPost to get shipping quotes, but falls back to a flat shipping quote scheme when the circuit breaker determines that EasyPost has failed:
circuit records a fixed-size window of recent failures and successes in Redis, and uses that information to calculate the error rate. It also has an override for each service that allows us to enable and disable services manually.
Retrying a web service call only makes sense when the request is expected to succeed — sending a retry that’s expected to fail usually results in wasted resources. By that logic, if a service has a very low error rate, then it’s worthwhile to retry a failed call.
A site can use the same error rate statistics collected by a circuit breaker to determine when to send a retry. If the error rate is below a configured threshold, the site sends a retry. This is a minor, but useful, extension to the original circuit breaker pattern.
Explicitly configuring timeouts for your site mitigates problems caused by slow web services cause. For example, say a lag of > 5 seconds when fetching shipping quotes causes customers to abandon your site’s checkout flow. Setting the timeout for the shipping quote service to less than 5 seconds can keep you from losing customers, assuming your site is designed to handle failures.
Used in combination with a circuit breaker, properly configured timeouts cause the circuit breaker to trip and disable the slow service with minimal impact on the customer.
At Man Crates, we set the default timeout to two seconds, in order to keep the customer from waiting for more than about two seconds for a web service before the fallback takes over.
Bottom line: Actions Man Crates took to handle failures gracefully
- Minimize failure points. When possible, move web service calls to a background job, or eliminate them entirely.
- Avoid single points of failure. Use a fallback for each failure point that fits the specific use-case.
- Track error rates. Monitor error rates and use the circuit breaker pattern to determine if a web service is faulty.
- Configure timeouts correctly. Don’t force your customers to wait for slow web services.