Managing Distributed Varnish Caches

Published in

Neighborhoods.com Engineering

5 min readJul 26, 2018

If you’re not familiar with Varnish, it’s caching software that works as an HTTP reverse proxy. Varnish can be an incredibly powerful tool for improving the speed at which you are able to serve content to users as well as significantly reduce the traffic load driven to your web application. The rest of this article will assume you have a working familiarity with Varnish and focus on a method we’ve been using at Neighborhoods.com to manage Varnish instances in auto-scaling environments on AWS.

The Problem Space

One potential challenge with Varnish is scale. The free version of Varnish does not offer a solution for a shared cache storage across multiple Varnish instances. This creates some interesting challenges as far as activities like pre-warming caches or clearing cache items. You should have limited circumstances where you need to clear cache items as you should have appropriate TTLs for different types of content managed via your expires headers. However, there are invariably going to be cases where you need something cleared immediately.

The typical method of dealing with this is to issue PURGE requests to each of your Varnish instances from your application. While this works, there are a few issues with this approach that I wanted to avoid. First, it’s challenging to make it resilient in your application. For example, if you’ve got five Varnish instances to clear, what happens when one of your network requests to a single instance fails? Now you’ve got to have some kind of retry or alerting mechanism built into your application code. As your Varnish footprint grows, you’ve also got to make sure that your application is aware of every Varnish instance. You can do this with a service discovery system, but that brings the potential of race conditions as well as the need to manage a service discovery system. In our case, we typically favor DNS conventions over a service discovery system for most of our needs.

Our Solution Requirements

Given the above pitfalls, we wanted a solution for Varnish management to satisfied the following:

Our application must be able to perform a single outbound request to perform Varnish management actions.
The solution must scale.
The solution must work without the need for a service discovery system.
The solution must be fast.
The solution must be fault tolerant.
Race conditions must be minimized.

Our Solution

The solution we decided on is one that makes use of a few AWS tools to create a simple and resilient workflow. What we decided on was using an SNS fanout system to send Varnish actions to instance-specific SQS queues and have a lightweight listener application on each Varnish instance to consume messages from its specific SQS queue.

Queue Management

Because our Varnish instances are in an autoscaling group, we made use of EC2 autoscaling lifecycle hooks in order to make managing this system autonomous. When a new EC2 instance comes up, it sends a lifecycle event to an SNS topic that we use to trigger a Lambda function. The Lambda function gets metadata about the instance and uses that to create an instance-specific SQS queue and subscribe that SQS queue to our SNS topic.

Here’s an abbreviated flowchart of what happens when a new Varnish EC2 instance comes online:

With this workflow, we ensure that if something goes wrong with setting up the SQS queue or SNS subscription, we do not allow the EC2 Varnish instance to proceed to InService. The Autoscaling Group will then launch a new instance and try again.

We have a similar process when an EC2 instance is going to go offline. We use the Autoscaling Lifecycle to destroy an instances associated SQS queues and SNS subscriptions.

Varnish Admin Actions

When we need to perform an admin action with Varnish, we now send a message to the SNS topic to which our SQS queues have been subscribed. That message fans out to the SQS queues and then a lightweight background worker we install on each Varnish instance long polls the SQS queue in to look for work to do. Since the SQS queue is named for the instance, the application simply listens to a queue that has a prefix followed by the instance id as it’s name. We can simply check the EC2 metadata endpoint to get the instance id. This way, we do not have to manage configurations for each instance to know which SQS queue to listen to. Easy, right?

The queue worker app is just a simple PHP application that is managed by Kōjō to ensure it’s always running and listening. The PHP app uses our modified version of the Varnish Admin Socket PHP library to communicate with Varnish on its own instance.

The message our management app transmits to SNS is simple JSON object that looks like the following:

{
 "varnish_action": "ban",
 "content": "^some-url-pattern$"
}

We ensure that our content reflects whatever the Varnish admin socket understands for the given action type. This way we don’t have to worry about maintaining our own mapping of actions in our worker app to actual Varnish commands. In any case, if you follow this pattern you’ll need to make sure your app knows how to talk to Varnish.

Here’s what our Varnish messaging workflow looks like:

The workflow of sending messages to Varnish

Results

So far this pattern has been tremendously successful for us. With long polling and Kōjō’s ability to ensure events are consumed quickly, it typically takes under 10ms from the moment we issue the SNS message to the time the Varnish action has completed on every Varnish instance. We’ve been using this pattern for about six months now in front of our 55places.com frontend and have never had any issues with orphaned queues, repeated instantiation failures, or scaling out our Varnish footprint. Whether or not this pattern is right for you is up to your dev and infrastructure teams, but given our success I highly recommend it over the typical sequential/round-robin method of managing Varnish.