Engineering For Failure

Boris Cherkasky
Riskified Tech
Published in
7 min readSep 3, 2020

--

Not so long ago, our systems were simple: we had one machine, with one process, probably no more than one external datastore, and the entire request lifecycle was processed and handled within this simple world.

Our users were also accustomed to a certain SLA standard — a 2-second page load time could have been acceptable a few years ago, but waiting more than a second for an Instagram post is unthinkable nowadays.

(Warning: buzzwords ahead)

When systems get more complex, with strict latency requirements and a distributed infrastructure, an uninvited guest crawls up our systems — request failure.

With each additional request to an external service within the lifecycle of a user request, we’re adding another chance for failure. With every additional datastore, we’re open to an increased risk of failure. With every feature we add, we risk increasing our latency long-tail, resulting in a degraded user experience in some portion of the requests.

In this article, I’ll cover some of the basic ways we at Riskified handle failures in order to provide maximal uptime and optimal service to our customers.

Failure by example

Every external service, no matter how good and reliable, will fail at some point. We at Riskified learned this the hard way when we experienced short failures with a managed, highly available service that almost resulted in data loss. That incident taught us the hard lesson that request failures should be handled gracefully.

In Google’s superbly written Site Reliability Engineering Book, they describe The Global Chubby Planned Outage, in which a service was so reliable, that its customers were using it without taking into account the possibility of failure, and even using it without a real essential need, just because it was so reliable.

As a result, Chubby, Google’s distributed locking system, was set a Service Level Objective (SLO) for service uptime, and for each quarter this SLO is met, the team responsible for the service intentionally takes it down. Their goal is to educate users that the service is not fail-safe and that they need to account for external service failures in their products.

So how should engineers handle request failures? Let’s cover some comment patterns:

Retrying

Retrying a failed request can, in many cases, solve the problem. This is the obvious solution, assuming network failures are sporadic and unpredictable. Just set a reasonable timeout for each request you send out to an external resource, and the number of retries you want, and you’re done! Your system is now more reliable.

Something to consider, however, is that additional retries can cause additional load on the system you’re calling, and make an already failing system fail harder.

Implementing and configuring short-circuiting mechanisms might be a thing to consider. You can read more about it in this interesting Shopify engineering blog post.

Prefetching — Fail outside of the main flow

One of the best ways to avoid failure while calling an external service is to avoid calling this service at all.

Let’s say we’re implementing an online store — we have a user service and an order service, and the order service needs the current user’s email address in order to send them an invoice for their last purchase.

The fact that we need the email address, doesn’t mean we have to query the user service while the user is logged in and waiting for order confirmation. It just means that an email address should be available.

In cases of fairly static data, we can easily pre-fetch all (or some) user details from the user service in a background process. This way, the email is already available during order processing, and we don’t need to call the external service. In the event the service fails to fetch user details, that failure remains outside of the main processing flow and is “hidden” from the user.

In his talk, Jimmy Bogard explains it better than I do (the link starts from his explanation about prefetching, although the whole talk is great!)

Best efforting

In some cases, we should just embrace failure, and continue processing without the data we were trying to get. You’re probably wondering — if we don’t need the data, why are we querying it at all?

The best example we have for this in Riskified is a Redis-based distributed locking mechanism that we use to block concurrent transactions in some cases. Since we’re a low-latency oriented service, we didn’t want a latency surge in lock acquiring to cause us to exceed the SLA requirements of our customers. We set a very strict timeout on lock acquiring so that when the timeout is reached, we continue unlocked — i.e we prefer race conditions over the increase in latency for our customers. In other words, the locking feature is a “nice to have” feature in our process.

Falling back to previous or estimated results

In some cases, you may be able to use previous results or sub-optimal estimations to handle a request while other services are unavailable.

Let’s say we’re implementing a navigation system, and one of the features we want is traffic jam predictions.

We’d probably have a JammingService (not to be confused with the Bob Marley song), that we’d call with our route to estimate the probability of traffic jams. When this service is failing, we might choose a sub-optimal course of action, while still serving the request:

  1. Using previous results: we might cache some “common” jam predictions and serve them, we might even pre-fetch the jam estimation for the most commonly used routes of some of our users.
  2. Estimate a result: Our service can hold a mapping of mean jam estimation per region and serve that estimation for all requests for routes in the region.

In both examples, the solution is obviously not optimal, but probably be better than failing a request. The general idea here is to make a simple estimation of the result we’re trying to get from the external resource.

Delaying a response

If the business of the product allows it, it’s possible to delay the processing of the request until the problem with the external resource is solved.

As an example, let’s take the JammingService from the previous solution — when it fails we can decide to queue all requests in some internal queue, return a response to the user that the request cannot be processed at the moment, but a response will be available as soon as possible via push notification to the user’s phone, or via webhook for example.

This is possible mostly in asynchronous services, where we can separate between the request and the response. (If you can design the service to be asynchronous to begin with, that’s even better!)

Implement simplified fallback logic

On some mission-critical features, a more complex solution is needed. In some cases, the external service is so critical to our services, that we’d have to fail a request if the external service fails.

One of the solutions we devised for such critical external resources, is to use “simplified” in-process versions of them. In other words, we’re re-implementing a simplified version of the external service as a fallback within our service, so that in the event the external service fails, we still have some data to work with, and can successfully process the request.

As an example, let’s go back to our navigation system. It might be such an important feature of our system, that we want each request to have a fairly good traffic jam estimation, even if our JammingService is down.

Our JammingService probably uses various complex machine learning algorithms and external data sources. In our simplified fallback version of it, we might choose, for example, to implement it using a simple greedy best-first algorithm, with simple optimizations.

In this case, even if there’s a failure of the JammingService, some fairly good traffic jam estimation is available within our navigation system.

This isn’t optimal since now we need to maintain two versions of the same feature, but when the feature is critical enough, and may be unstable enough — it could be worth it.

Closing thoughts — Failing as a way of life

At school, I was quite a bad student, so failing is not new to me. This taught me that as an engineer, anything I lay my hands on might fail, and simply catching the exception is not enough — we need to do something when we catch it, we still need to provide some level of service.

I encourage you to dedicate a big part of your time to failure handling, and to make it a habit to announce your systems are production-ready only when you handle your failures in a safe and business-oriented way.

As always, you’re welcome to find me at my Twitter handle: @cherkaskyb

--

--

Boris Cherkasky
Riskified Tech

Software engineer, clean coder, scuba diver, and a big fan of a good laugh. @cherkaskyb on Twitter