Clustering Varnish with a containerised backend on AWS: how to get it right

Kevin Baldwyn
Dec 21, 2017 · 6 min read

Varnish is a great tool for speeding up cacheable page requests, however getting it setup with HTTPS, Elastic Load Balancers and making it redundant and highly available is harder to get right.

We’ve been running Varnish with HTTPS in a simpler environment for quite some time but our recent move to AWS and a dockerised application with requirements for greater redundancy meant we found we had a number of problems to solve.

Problem 1: Caching HTTPS

Varnish by default does not cache secure content, that is, anything served over HTTPS. In a typical setup you place Varnish in front of your web server and configure it to listen on port 80 (rather than its default 6081), then proxy any cache misses to the backend web server which you configure to listen on another port.

Being modern security conscious engineers we want to serve our applications over HTTPS. The default setup then won’t work.

In a previous set up we used Nginx in front of Varnish to reverse proxy HTTPS requests to Varnish.

In this setup Varnish is essentially acting as a load balancer to the backend, however one of the main goals here was to be able to cluster Varnish and build in some redundancy. This means we couldn’t put Varnish at the front of the stack because we need a way of load balancing the requests to the cluster of Varnish instances.

Luckily this is made easy with AWS. We simply setup an application load balancer and listen to port 443. We then forward that to port 6081 (the Varnish default) on the Varnish instance. This way the load balancer is doing all the ssl termination and simply forwards to Varnish, no need for the Nginx proxy.

Problem 2: redirecting HTTP to HTTPS

We obviously want to correctly redirect anyone that enters HTTP rather than HTTPS. To do this we setup a new listener on the application load balancer to listen on port 80 and forward to a new target group on port 80 on our Varnish instances. This listener is simply an Nginx server that takes any requests to port 80 and does a simple 301 redirect to the HTTPS version of the same request.

server {
listen 80 default_server;
listen [::]:80 default_server;

#force requests over https
location / {
return 301 https://$http_host$request_uri$is_args$query_string;

Problem 3: ECS backend with Elastic Load Balancer

So now we are handling incoming requests and sending them to the backend. We are using ECS to run a cluster of Docker containers. Those containers all listen on port 80 and serve web traffic. The traffic to those containers is distributed with an application load balancer, which is what the backend for our Varnish servers needs to point to.

This presents our next problem. AWS ELBs are elastic in nature and do not have a fixed IP (or set of fixed IPs). You always refer to them by their CNAME ie This means AWS can manage the IPs addresses in the background and you don’t have to worry about them changing. This is great, the only issue being if you set the backend for Varnish to the cname of your ELB then Varnish will cache the DNS resolution of the cname on startup, meaning if your ELB changes IP at any point after Varnish has started then it won’t be aware of it and won’t be correctly routing traffic.

We tried having Varnish send its backend traffic to an Nginx proxy (listening on some other port for example 8080) instead and have that Nginx proxy pass requests to the ELB cname, however Nginx does exactly the same thing as Varnish.

To work around this you can use some Nginx trickery to more aggressively avoid caching DNS lookups, which is described well here.

server {
listen 8080;

location / {

# this causes DNS resolution of upstream every 10 minutes
resolver valid=600s;

# setting backend as a variable in proxy_pass ensures cache invalidation of DNS lookup
# becuase nginx will requery name servers according to resolver defined above
set $backend_servers;

proxy_pass $backend_servers;
proxy_set_header Host $http_host;

So excluding how we actually do the clustering (which is coming up next) our final setup looks something like the diagram below. Here we are showing the requests from Varnish being sent directly to Nginx on port 8080, this is a simplification that we will expand on next.

Great. We now have our requests coming in on port 443 and hitting our backend Docker cluster, and any requests to port 80 will be redirected back to 443.

Problem 4: Clustering Varnish with automatic re-routing

The main point of all this was that we wanted to add some redundancy to our Varnish setup, previously we had one Varnish instance and it was a single point of failure.

So the way we will configure our Varnish cluster is to have the cache sharded over multiple Varnish instances. In Varnish 4 this could be done with the `hash` director but it didn’t handle failure very well or if your cluster changed size (ie new nodes are added or removed). However Varnish 5 introduced the `shard` director which is designed for exactly this scenario. There is a great article on the Varnish blog about different ways of sharding a Varnish cluster.

Essentially what we do is define each Varnish instance as a backend for the sharded cluster, along with the actual backend for each instance : `localhost:8080` (the proxy to the Docker cluster):

vcl 4.0;

import std;
import directors;

# list the varnish nodes in our cluster
backend node1 {
.host = "";
.port = "80";
backend node2 {
.host = "";
.port = "80";

# the local instance backend proxy where we actually fetch the content from
backend content {
.host = "";
.port = "8080";
# define the cluster
sub vcl_init {
new cluster = directors.shard();

sub vcl_recv {

# Figure out where the content is
set req.backend_hint = cluster.backend();

With this setup, each Varnish instance is seen as a backend for all the other Varnish instances. So an uncached request is sharded by URL and then Varnish picks a backend from one of the nodes in the cluster. That node sends its request to localhost:8080 (the backend proxy) and then returns it to the requesting node which caches it. The next request for the same URL might come into another Varnish instance (via the ELB), the cluster director will know which instance has the cache for that request and will forward the request to that instance. If one of your instances fails then Varnish will handle that and treat it as an uncached request and simply re-route it to a new healthy instance. This is illustrated in this diagram:

The only problem with this is that everytime you add a new instance to your vanish cluster you need to update the vcl config for every instance in the cluster to have the same list of nodes to include this new instance. However this is something that can be handled with configuration management fairly easily. We wrote a simple script that queries the EC2 instances in the cluster using the AWS api and writes the vcl configuration based on the host names of the machines in the cluster.

Funeral Zone Engineering

Musings, learnings and sharing from the design and…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store