A starting point for setting timeouts

Choosing timeout values for external service calls isn’t easy. There’s usually black magic involved and sometimes we’re just guessing. This post details the approach we used at Bluestem to choose better values and make our systems more resilient to performance issues and outages in our backend services.

Where we were

  1. Copy/paste the nearest Hystrix settings from another service
  2. Set the timeouts to be super high so circuits never trip

Method 1 is bad because every service is different, getting inventory for a product will probably have a wildly different response time than submitting an order. Method 2 is bad because high timeouts and thread pools completely negate the benefits of using Hystrix, and if we aren’t getting any benefits then it’s not worth the complexity.

In the past year we realized that external services were an Achilles heel in our systems and long timeouts in particular were a major cause of cascading failures in the web stack.

Tuning timeouts

  1. Load test to find the service’s behavior under maximum steady-state load (i.e. how much load can this service sustain indefinitely before it tips?).
  2. Gather empirical data on throughput, latency, and error rate from those load tests.
  3. Set baseline timeouts (and alerts) based on this data.
  4. Refine timeouts based on continued data-gathering to bring error rates to acceptable levels.

Baseline values

What is the 99th percentile response time?

What’s a standard deviation?

Why three standard deviations?

Big Caveat: This assumes that our 99th percentile response time data points have a normal distribution. They don’t, but it’s a close enough approximation to give us a starting point.

Refinements

Results

Resiliency

Service Level Objectives

Alerts

The views expressed in this article are solely my own and do not necessarily reflect the views of Bluestem Brands, Inc.

--

--

Senior Software Development Engineer @ Amazon. Trumpet player, drum corps enthusiast.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Kevan Ahlquist

Senior Software Development Engineer @ Amazon. Trumpet player, drum corps enthusiast.