At Mesosphere we run two CloudFront distributions for our DC/OS Universe downloads service. One for downloads.mesosphere.com and one for downloads.mesosphere.io, where the latter is mostly still in use for historic reasons and only receives a single percentage of total traffic. All requests used to get proxied to a singe S3 bucket origin.
Over the last few month we noticed that our traffic on CloudFront grew continuously and climbed into regions of multiple hundreds TB a month, whereas the total number of requests was not as significant. At least that’s what some of our engineers noticed from their previous experience with high traffic web services. So we started investigating:
After dumping roughly two days worth of CloudFront log files into an analyzer script to calculate a ‘top 20 most downloaded by size files’ list and the actual User-Agents to start optimizing our distribution patterns, I noticed that we maybe also should categorize by clients source location. This for example is a feature that CloudFronts built-in reporting does not offer out of the box. We hacked on the analyzer script to add a lookup of source IP to AWS region (which can also be ‘UNKNOWN’ for non-AWS based clients) by using the AWS supplied ip-ranges.json.
Our origin bucket already was in the us-east-1 region, so we wanted to try to offload traffic by allowing direct download from the bucket, instead of going through CloudFront, as intra-region traffic is free. Our suspicion was that most downloads in these regions are generated by our own DC/OS CI jobs and other test or integration clusters, as we spin up quite a lot of them during the day. While we brainstormed how to best keep traffic region-local, we noticed the new Labda@Edge feature (which was sill in closed beta back then) that would allow us to modify CloudFront’s routing logic ways beyond what the current URI path based behaviors could offer.
Lambda at edge, or short L@E, is a variant of the AWS Lambda server less compute feature. It uses the known Lambda APIs and environment (although it only allows the Node.js 6.10 runtime right now) and once a function is defined, it can be associated with an existing CloudFront distribution path pattern for one out of for different events for each request. To achieve the necessary performance to be run on each incoming request, the functions are limited in both maximum execution time, memory consumption and allowances to do external requests. The functions are then automatically replicated to all other AWS regions to be run closest to the CloudFront edge location receiving the request. To read more the technical details of L@E, please see the official documentation: http://docs.aws.amazon.com/lambda/latest/dg/lambda-edge.html
After getting enrolled in the beta program we implemented a first function that basically inspects the source IP field in the Lambda event, checks a list of known AWS IP ranges and then either forwards the request unmodified or answers with a HTTP 301 — Redirect. For each AWS region a redirection target (S3 bucket URL) can be given. Traffic from regions with no dedicated direct download origin does not get redirect.
This function ran on a dedicated CloudFront distribution for some days while we benchmarked and tested with various of our tool. Noteworthy warning: A simple `curl https://...` call does not follow redirects by default, so some of our CI jobs had to be changed.
When benchmarking download speeds from our origin bucket to other regions it was obvious that we could not serve cross-region with a single us-east-1 based bucket. Luckily S3 offers convenient cross-region replication and we created a second receiving bucket in us-west-2, the single largest region traffic wise for us. After an initial sync of especially important files with
aws s3 sync --exclude=’*’ --include=’*.tar.*z*’ s3://downloads.mesosphere.io/ s3://us-west-2-downloads.mesosphere.io/ (that’s already a few TB of data that needed to transfer and took some days), we were pretty hyped to roll out the new feature. Beginning early morning the next Monday, we started to serve with our ‘viewer_request_local_s3_bucket_redirector.js’ function enabled.
Here is a view on our CloudWatch dashboard that shows build-in CloudFront metrics alongside custom made metric-filters in all regions for our Lambda function replication. The launch of LambdaUniverse (that’s how we call this project internally) is marked in purple below:
The overall download traffic decreased instantly, as most larger files were served via S3 directly without going through CloudFront. Green vs orange in the smaller graphs shows the result of the invoked Lambda calls to redirect or not. In blue are request from sources outside of AWS, that wouldn’t get redirect anyway. Please note that we activated the redirector only for a small set of potentially large files, based on their suffix. We estimate that there can be significant additional savings if we can come up with a smarter way to group large’ish binaries and files it doesn’t make sense to redirect because they’re a few kilobytes only anyway.
In terms of costs saved, here is a snapshot from cost explorer:
That’s a 46% save for a random day in comparison to the weeks before. Looks good, doesn’t it? :)
So, back to the click-baity title… “Can Lambda@Edge make AWS CloudFront the most flexible CDN out there?”
I certainly think it can. The DIY possibilities L@E offers seem unmatched by other CDNs and services I used before and I’m very keen to see what other use cases people will solve with it. Please let me know!