Making things Fast-ly —A Guide to Implementing Load Balancing and Failover using Terraform and VCL

Azraf Ullah
USA TODAY NETWORK
Published in
5 min readDec 23, 2019
Shifting traffic automatically and programmatically can help reduce customer impact and human error

At the USA TODAY NETWORK, we serve millions of daily requests from our users. To keep up with this demand while providing a great experience for consumers, we use a unique system of load balancing as an effective way to distribute traffic and manage capacity on our servers. (Load balancing is efficiently distributing incoming network traffic across a group of backend servers.) This results in high-performing applications at lower costs, increased scalability, the ability to handle sudden traffic spikes, and protection from outages.

There are many ways to achieve traffic shifting and failover on servers, but we’ve decided to implement this at the edge using Fastly Load Balancing and our own custom VCL (Varnish Configuration Language) and Terraform solution. In situations where our systems are unavailable, we depend on a failover process with multiple server regions.

The Approach

Early on, we realized that shielding and Route53 load balancing caused issues. With Fastly shielding in place, the requests were all being sent to a single region until the cache expired.

This is a graph with the requests / min between the east and west regions to origin, prior to implementing Fastly load balancing.

Route53 and Fastly Shield sent all traffic to a single region till the cache expired

As a company managing more than 100 sites, this caused issues with us being able to right size our deployments. We needed to fix the issue and had some requirements in mind:

  • Load balancing of requests to origin to eliminate the thrashing of requests between regions
  • Site Reliability Engineering (SRE) is able to view the current weights of each site
  • Each site has its own health checks and can be moved between regions on its own
  • SRE able to shift weights as desired
  • All health checks originate from Fastly

As a result there were several ways we could approach this problem:

  • We could have one dynamic server pool per site (standard way of load balancing at Fastly) this would have worked for us except for the fact that it would require us to push a new version of our VCL to shift traffic for a single site, and since we have more than 100 sites, managing that all through service activations wasn’t optimal.
  • Two backends for each site, one for each region (our custom solution) controlled in VCL by weight values in an edge dictionary. Edge dictionaries are key-value stores that are version less and require no new service deployments by fastly. This worked out better for us because it allowed us to leverage one API call to update the edge dictionary to shift weights, and all the code was stored on Github. Although there are some VCL drawbacks it wasn’t something we couldn’t automate around. We had to make significant changes for every backend. We used a templater to make it easier.

This graph shows how much more even the distribution is after implementing our custom solution.

Requests are balanced with Fastly LB distributing to both regions evenly

By the following the steps below, here’s how you can see the impact.

Assumptions

  1. Having a Fastly account, otherwise you can sign up here.
  2. Make sure you have a couple of sample sites/apps available to route traffic with, i’m using imtestingthings.com.
  3. Know how to create and work with a service in Fastly.
  4. Understand how Terraform works, and refer to the Fastly Terraform Provider in the steps.

Automatic Healthchecks Can Reduce Errors

By setting a service.tf file, we are automating the process of sending healthchecks to see if a region is healthy or not. This can help us programmatically shift traffic without manual intervention, reducing human error and incident response times. This is all based off the Fastly Terraform Provider located here, of which we will need the backend, and healthcheck functions located specifically in this file.

Automating Fastly Services With Terraform

Set up service name, and domain, terraform acts upon the service ID (blued out)
  • Your domain and service will interact with the service.tf file once Terraform is run, which will automatically generate your hosts, and healthchecks, based on your VCL and how many sites you have defined. This is pretty sweet because it saves you time and effort on having to manually deploy your services. Terraform takes care of it automatically.
  • To take advantage of this, make sure that you’re able to Terraform plan, and Terraform apply to enable the Fastly changes. Here is another blog post which touches on this setup process in Google Cloud Platform. The official Terraform plan and apply documentation is also located here.

Write Logic For Traffic Shifting

  • We will want to set up a subroutine in another VCL file to handle the actual programmatic traffic shifting. Let’s call it traffic.vcl. We’ll want to setup request headers and request backends and make use of the randombool function in Fastly, which basically evaluates traffic using boolean logic. You can visit here for more information.
  • The biggest takeaway is that the logic makes it so traffic will automatically shift if found to be unhealthy.

Create Edge Dictionary to Shift Weights

  • The last step is to create an edge dictionary on Fastly to shift weights. An edge dictionary is an easy way to store data as key-value pairs on the edge and is defined here by Fastly. This is pretty important because Edge dictionaries are version-less which means instead of deploying a new service to Fastly every time we have to make a change, the service can just read the edge dictionary table to find out what the values are. This saves a tremendous amount of time and effort.
  • On your active service configuration on the Fastly UI, scroll all the way on the sidebar on the left and hit dictionaries under the Data category. Click Create Edge Dictionaries and then set the name to what you set the table.lookup function name in your traffic.vcl file. Then you want to set a key-value pair for each site for traffic balancing, 0 would set it all the way west, 100 would set it all the way east, and 50 would keep it balanced between both. You can refer to the image below for clarity.
A value of 100 puts max traffic on east, 50 would keep it balanced, 0 would default to west

Why Does This Matter? What Are The Benefits?

By implementing this specific strategy on Fastly, you gain the following benefits as opposed to other methods:

  1. Ability to view the current weights of each site
  2. Ability to move between regions based on individual health checks
  3. Shift weights as desired without deploying a new Fastly service
  4. All health checks originating only from Fastly, and not application origins.

Thanks for reading and hopefully this can help you on your load balancing and failover journey.

--

--