Failing with CloudFront Origin Groups
Improving availability with CloudFront origin failover using Terraform
You stream a boxing match on your smart TV. Or sign up on the mobile app. Or even browse the website on your laptop. The backend is a key component to enable all this and deliver the DAZN experience that fans expect. Yet, systems can — and will — fail at the most inconvenient of times. Ensuring high availability is a must at our scale because we are streaming live events.
Our traffic is very spiky, and peaks moments before a live sports event kicks off. Often, serving generic content to millions of users on the platform gets cached. We use CloudFront as a CDN to cache long-lived data to the closest edge location of the user. CloudFront helps reduce the load to our backend services. But what happens if the backend itself is having issues? Errors would return back to the caller and this wouldn’t be a great experience for the eager fans. Enter CloudFront Origin failover 🙌
Simply put, if CloudFront receives an error response from a primary origin, it tries the same request to a failover origin.
You must have at least 2 origins setup to be able to create an origin group. In the example above, CloudFront would first send an origin request to
Custom-example1.com. If it returns an error status code in the
Failover criteria list then CloudFront would fail over to
Custom-example2.com. The response from this failover origin is then returned to the caller.
You can failover with Lambda@Edge too, but it doesn’t work for the spiky traffic patterns that we experience. If we were to invoke a Lambda@Edge function for every request to CloudFront, we will hit the limit on how quick Lambda can scale out. Currently, that is 500 new concurrent executions per minute.
Rollout of a new service using CloudFront origin failover and Terraform
We have recently been building a new service to replace a legacy service. The service surfaces data to the frontend. Data such as the splash background image, titles, team images and other metadata.
As with any new service entering production, it’s acceptable that there will be issues during the rollout. We want to minimise the impact radius of any bugs we didn’t catch during development. As such, we need to be able to roll out the replacement service to a small percentage of users at a time until we have high confidence. As a precaution, if the replacement service returns an error response then we will also fallback to the original service.
The illustration below shows the architecture for the replacing service. The focus is on the CloudFront failover configuration. Sidenote — as you can see, we are fans of serverless architectures 👌
We use Terraform to stand up our environments and services. It’s a great way to keep infrastructure as code. It also allows us to have consistent environments when collaborating as a team. The Terraform AWS provider is receiving an update this week supporting CloudFront origin groups. It will be available as of
Here’s the snippet on configuring a CloudFront distribution resource with an origin group.
terraform apply and you are good to go. You too can now fail using CloudFront origin groups with little effort 👍
A note on monitoring failover
Currently, there isn’t an on out of the box way to track the percentage of primary origin request failures. Metrics like
TotalErrorRate report after taking into account the responses of failover origins. This means that a failover success response is not counted towards the
At the time of writing, there are no metrics in CloudFront that show you the failure rate of the primary origin. You could roll your own metrics by using Lambda@Edge and checking for the failover origin value under
server. But it defeats the entire purpose of keeping things simple by using origin failover.
CloudFront origin failover is a new feature launched recently at re:Invent 2018 a few months ago. We welcome the optimisation of high availability for CloudFront.
It is a simple way to keep your endpoints up. There is also no extra cost for using CloudFront origin failover.
One improvement we would like to see is a CloudWatch metric to track rates of failover.