Clustering Varnish with a containerised backend on AWS: how to get it right

Published in

Funeral Zone Engineering

6 min readDec 21, 2017

Varnish is a great tool for speeding up cacheable page requests, however getting it setup with HTTPS, Elastic Load Balancers and making it redundant and highly available is harder to get right.

We’ve been running Varnish with HTTPS in a simpler environment for quite some time but our recent move to AWS and a dockerised application with requirements for greater redundancy meant we found we had a number of problems to solve.

Problem 1: Caching HTTPS

Varnish by default does not cache secure content, that is, anything served over HTTPS. In a typical setup you place Varnish in front of your web server and configure it to listen on port 80 (rather than its default 6081), then proxy any cache misses to the backend web server which you configure to listen on another port.

Being modern security conscious engineers we want to serve our applications over HTTPS. The default setup then won’t work.

In a previous set up we used Nginx in front of Varnish to reverse proxy HTTPS requests to Varnish.

In this setup Varnish is essentially acting as a load balancer to the backend, however one of the main goals here was to be able to cluster Varnish and build in some redundancy. This means we couldn’t put Varnish at the front of the stack because we need a way of load balancing the requests to the cluster of Varnish instances.

Luckily this is made easy with AWS. We simply setup an application load balancer and listen to port 443. We then forward that to port 6081 (the Varnish default) on the Varnish instance. This way the load balancer is doing all the ssl termination and simply forwards to Varnish, no need for the Nginx proxy.

Problem 2: redirecting HTTP to HTTPS

We obviously want to correctly redirect anyone that enters HTTP rather than HTTPS. To do this we setup a new listener on the application load balancer to listen on port 80 and forward to a new target group on port 80 on our Varnish instances. This listener is simply an Nginx server that takes any requests to port 80 and does a simple 301 redirect to the HTTPS version of the same request.

server {
    listen 80 default_server;
    listen [::]:80 default_server;

    #force requests over https
    location / {
        return 301 https://$http_host$request_uri$is_args$query_string;
    }
}

Problem 3: ECS backend with Elastic Load Balancer

So now we are handling incoming requests and sending them to the backend. We are using ECS to run a cluster of Docker containers. Those containers all listen on port 80 and serve web traffic. The traffic to those containers is distributed with an application load balancer, which is what the backend for our Varnish servers needs to point to.

This presents our next problem. AWS ELBs are elastic in nature and do not have a fixed IP (or set of fixed IPs). You always refer to them by their CNAME ie http://internal-my-elb-1098129567.eu-west-1.elb.amazonaws.com. This means AWS can manage the IPs addresses in the background and you don’t have to worry about them changing. This is great, the only issue being if you set the backend for Varnish to the cname of your ELB then Varnish will cache the DNS resolution of the cname on startup, meaning if your ELB changes IP at any point after Varnish has started then it won’t be aware of it and won’t be correctly routing traffic.

We tried having Varnish send its backend traffic to an Nginx proxy (listening on some other port for example 8080) instead and have that Nginx proxy pass requests to the ELB cname, however Nginx does exactly the same thing as Varnish.

To work around this you can use some Nginx trickery to more aggressively avoid caching DNS lookups, which is described well here.

server {
    listen       8080;

    location / {

       # this causes DNS resolution of upstream every 10 minutes
       resolver 8.8.8.8 valid=600s;

       # setting backend as a variable in proxy_pass ensures cache invalidation of DNS lookup
       # becuase nginx will requery name servers according to resolver defined above
       set $backend_servers http://internal-my-elb-1098129567.eu-west-1.elb.amazonaws.com;

       proxy_pass $backend_servers;
       proxy_set_header Host $http_host;
   }
}

So excluding how we actually do the clustering (which is coming up next) our final setup looks something like the diagram below. Here we are showing the requests from Varnish being sent directly to Nginx on port 8080, this is a simplification that we will expand on next.

Great. We now have our requests coming in on port 443 and hitting our backend Docker cluster, and any requests to port 80 will be redirected back to 443.

Problem 4: Clustering Varnish with automatic re-routing

The main point of all this was that we wanted to add some redundancy to our Varnish setup, previously we had one Varnish instance and it was a single point of failure.

So the way we will configure our Varnish cluster is to have the cache sharded over multiple Varnish instances. In Varnish 4 this could be done with the `hash` director but it didn’t handle failure very well or if your cluster changed size (ie new nodes are added or removed). However Varnish 5 introduced the `shard` director which is designed for exactly this scenario. There is a great article on the Varnish blog about different ways of sharding a Varnish cluster.

Essentially what we do is define each Varnish instance as a backend for the sharded cluster, along with the actual backend for each instance : `localhost:8080` (the proxy to the Docker cluster):

vcl 4.0;

import std;
import directors;

# list the varnish nodes in our cluster
backend node1 {
   .host = "ip-172-12-34-56.eu-west-1.compute.internal";
   .port = "80";
}
backend node2 {
   .host = "ip-172-34-56-78.eu-west-1.compute.internal";
   .port = "80";
}

# the local instance backend proxy where we actually fetch the content from
backend content {
    .host = "127.0.0.1";
    .port = "8080";
}# define the cluster
sub vcl_init {
   new cluster = directors.shard();
   cluster.add_backend(node1);
   cluster.add_backend(node2);
   cluster.add_backend(content);
   cluster.reconfigure();
}

sub vcl_recv {

    # Figure out where the content is
    set req.backend_hint = cluster.backend();}

With this setup, each Varnish instance is seen as a backend for all the other Varnish instances. So an uncached request is sharded by URL and then Varnish picks a backend from one of the nodes in the cluster. That node sends its request to localhost:8080 (the backend proxy) and then returns it to the requesting node which caches it. The next request for the same URL might come into another Varnish instance (via the ELB), the cluster director will know which instance has the cache for that request and will forward the request to that instance. If one of your instances fails then Varnish will handle that and treat it as an uncached request and simply re-route it to a new healthy instance. This is illustrated in this diagram:

The only problem with this is that everytime you add a new instance to your vanish cluster you need to update the vcl config for every instance in the cluster to have the same list of nodes to include this new instance. However this is something that can be handled with configuration management fairly easily. We wrote a simple script that queries the EC2 instances in the cluster using the AWS api and writes the vcl configuration based on the host names of the machines in the cluster.

Clustering Varnish with a containerised backend on AWS: how to get it right

Problem 1: Caching HTTPS

Problem 2: redirecting HTTP to HTTPS

Problem 3: ECS backend with Elastic Load Balancer

Problem 4: Clustering Varnish with automatic re-routing

Written by Kevin Baldwyn