How to shield customers from upstream adversity with Varnish Cache

Published in

John Lewis Partnership Software Engineering

4 min readMay 24, 2022

Hi, I’m Tom, a software engineer at John Lewis & Partners, one of my primary focuses is to improve the performance and resilience of the johnlewis.com website

Introduction

Varnish Cache is a popular reverse HTTP proxy and HTTP accelerator. It sits in front of our web/application services and uses a range of caching, prefetching and compression techniques to reduce load on upstream backend services and deliver improved client response times to our customers.

In addition to enhancing performance, a perhaps lesser-known feature of Varnish, is that it can be configured to shield clients from upstream errors and outages. If an upstream service is temporarily unavailable or starts to return errors, Varnish is able to fall back to serving successful stale responses from its cache, so long as the grace time of the cached objects has not expired.

Controlling the Cache

The role of the cache-control header is to instruct downstream caches (browsers and proxy caches like Varnish) on how to cache responses. The value of s-maxage lets Varnish know the number of seconds the response can be stored in cache for, otherwise known as the time to live (TTL). The value of max-age will be used by browser caches (and also proxy caches in the case that s-maxage is not specified). Internal to Varnish the value is used to set the beresp.ttl variable.

The stale-while-revalidate value informs the cache how long it’s acceptable to reuse a stale response for. In the above example, the response will become stale after 10m (600s), the cache is then allowed to reuse it for any requests made within the following 2 days (172800s). This is also known as a grace period, internal to Varnish the value is used to set the beresp.grace variable. The first response fetched during the grace period will trigger an asynchronous background fetch request, making the cached object fresh again, without passing the latency cost of revalidation on to the client.

Additionally, if the backend service is unavailable or slow to respond, clients will be shielded from such responses until the grace period expires, hopefully providing the service with adequate time to recover, or engineers with adequate time to fix an issue, before it adversely impedes upon the customer experience. It shouldn’t be overlooked that this is a trade-off – whilst it will provide faster responses and greater resilience, it does also increase the chances of serving stale responses. Setting a high stale-while-revalidate duration is a judgement call and may not be appropriate for responses containing highly dynamic server-side rendered data, where freshness is paramount. We’ve tended to maximise the use of the feature on responses containing relatively static data, such as our editorial content and category landing pages.

Varnish Configuration

By default, Varnish will fall back to serving a stale response during the grace period if the backend service can’t be connected to, or if the request times out. The behaviour can be extended to 5xx errors with the following code in sub vcl_backend_response

sub vcl_backend_response {    if (beresp.status >= 500 && bereq.is_bgfetch) {
        return (abandon);
    }    ....
}

beresp.status contains the status code returned from the backend service, bereq.is_bgfetch will be true if the backend request was sent asynchronously following the client receiving a cached response and return (abandon) instructs Varnish to abandon the request and do nothing.

Unit Testing

Caching logic can quickly grow in complexity over time. In order to have confidence that the logic remains valid, we think it’s critically important to maintain a comprehensive suite of unit tests around it, to continuously verify that the caching configuration aligns with expected behaviours. To test Varnish in isolation, we start it up within a Docker Compose network, using Wiremock to stub out the backend services.

@Test
fun `stale page should be returned when backend errors within stale while revalidate period`() {
    
    browseAppStub.stub(url, "bc page", 200, CacheControl(1, 1, 10))

    given()
       .header("X-Route", "browse-app")
       .get(url)
       .then()
       .assertThat()
       .header("Cache-Control", 
          equalTo("max-age=1,s-maxage=1,stale-while-revalidate=10"))
       .body("html.body", containsString("bc page"))

    browseAppStub.stub(url, "error", 500)    // Fetch within Grace but outside TTL
    sleep(1000) //1s

    given()
       .header("X-Route", "browse-app")
       .get(url)
       .then()
       .assertThat()
       .statusCode(200)
       .body("html.body", containsString("bc page"))    // Fetch outside Grace
    sleep(10000) //10s

    given()
       .header("X-Route", "browse-app")
       .get(url)
       .then()
       .assertThat()
       .statusCode(500)
}

Concluding Thoughts

A stale cached response, if available, will more often than not provide a better customer experience than an error page. Where server-side rendered content is sufficiently static enough to allow for a high stale-while-revalidate duration, falling back on the cache can be a useful tool to have in your resilience back pocket.

At the John Lewis Partnership we value the creativity of our engineers to discover innovative solutions. We craft the future of two of Britain’s best loved brands (John Lewis & Waitrose).

We are currently recruiting across a range of software engineering specialisms. If you like what you have read and want to learn how to join us, take the first steps here.

How to shield customers from upstream adversity with Varnish Cache

Written by Tom S