On the Importance of Correct Headers and Metadata in S3 Origins for AWS CloudFront
By: Regis Wilson
A common pattern for serving static content using Amazon Web Services (AWS) is to use CloudFront to deliver content from Simple Storage Service (S3). This pattern is quick and convenient for getting static assets published and viewed on the internet. However, there are some deeper issues and technical problems needing solving that are hidden from view with this naive approach.
During a recent project, in which we tried to increase the performance of our landing pages, we discovered some strange problems with the way images are hosted for our site. Read on for more details about how we found the issue and subsequently solved it. This article will illustrate why using correct HyperText Transport Protocol (HTTP) headers and S3 metadata properties is so important.
When we first started looking into performance issues with our landing pages, we found references that suggested improvements to the static images being served from S3. We never expected to see suggestions around images and static content, as we always assumed our dynamic site was the main problem. For example, we got a report that stated:
Suggestion: Avoid extra requests by setting cache headers (cacheHeaders)
Details: The page has 120 request(s) that are missing a cache time. Configure a cache time so the browser doesn’t need to download them every time. It will save 2.8 MB the next access.
Suggestion: Avoid using incorrect mime types (mimeTypes)
Details: The page has 73 misconfigured mime type(s).
Suggestion: Don’t use private headers on static content (privateAssets)
Details: The page has 21 request(s) with private headers. Make sure that the assets really should be private and only used by one user. Otherwise, make it cacheable for everyone.
We scratched our heads at these statements; we were certain that we had set up our CloudFront distribution with caching enabled and that all the S3 objects were set to public. The suggestion to apply the correct settings for Multipurpose Internet Mail Extensions (MIME) was particularly worrying, because that meant that some parts of the page might not be set correctly and could possibly be displayed incorrectly, or not at all.
Diving into the Chrome waterfall charts to find out what was happening, we discovered that the reports were correct. Here we highlight a clear problem with the cache headers.
Some strange and inconsistent signals were further revealed by digging into the response headers. The screenshot below highlights how the Content-Type is incorrect (it should be “image/jpeg”) and even though the image is not cached, Amazon claims it is over 24 hours old!
Scrolling down through the response headers, we found that some headers were missing completely (which is hard to demonstrate, so you have to use your imagination). At least one item that was missing was a header that said something like “cache-control: public, max-age=84480.” Further, the Amazon headers seemed to indicate that the image was indeed cached with a “x-cache” header, showing a hit.
A rather disconcerting metric was also being emitted in the CloudFront CloudWatch Metrics graphs. Amazon was reporting that our static image distributions were suffering from miss rates of nearly 60%. In fact, the miss rate exceeded 70% at times.
A cache hit is defined as a request that is serviced directly by CloudFront from the local edge cache. A cache miss is defined as a request that is received by a CloudFront edge location, is not present in the cache, and is subsequently requested from the origin (in this case, an S3 bucket). The cache miss percentage is the number of requests (or sometimes the number of bytes) that are served from the origin versus the number of requests (or bytes) that are served from the cache.
Obviously, a cache hit at the edge should be faster than a request that goes to the origin. It should also be cheaper, because we do not incur transport costs to the origin, nor requests or transfer charges for retrieving objects from S3. A good, well-functioning cache should have a hit rate of 70% or better, which is what we would expect. However, we were observing miss rates of 70%, which is backward, and that was terrible.
Another confusing and mysterious symptom we encountered was a lot of HTTP 304 response codes in the CloudFront access logs and browser developer tools. An HTTP 304 response code means that the browser sent a request to CloudFront with an “If-Modified-Since” header and CloudFront responded with “304 Not Modified.” This piece of the mystery was even more confounding, since we could verify that CloudFront was registering a “hit,” as shown above, but the browser clients were constantly asking “Is it modified? Is it modified yet? Is it modified?” All of these unnecessary requests were potentially slowing down consumers’ experiences and possibly costing us transaction and transfer costs with AWS.
We struggled to understand the problem some more.
The Art of Caching
In order to achieve the optimal goal we first went back to the drawing board to understand how the distribution, buckets, and objects should be set up. Then we would need to observe how the distribution, buckets, and objects actually were set up to measure the discrepancy. If we did discover any gaps from “should be” and “actually were” then we would need to analyze the impact of these differences on our performance. If the design goals of “should be” were wrong, we’d need to discover a better design. If, however, the design goals were correct, then we’d need to address the discrepancies in how the actual infrastructure was implemented. Finally, we’d need to make sure we implemented changes in the process of creating and rolling out infrastructure and application data to make sure we didn’t make a similar mistake again in the future.
We’ll start with the S3 bucket. The S3 bucket needs to have a bucket ACL and a public policy to support serving objects via a CloudFront distribution. We’re not aware of any way to keep buckets and objects private while also serving them without authentication on the internet, say via CloudFront. It is simply the case that an S3 object needs to be world readable in order to be served without authentication via CloudFront or anywhere else. Since we were dealing with public photographs and static assets needed by browsers, we had no problem with this restriction. We also knew that this was working because we could see the images from our browsers on the site.
Next we investigated the S3 objects (the images or assets) themselves. This was where we uncovered some discrepancies in the way the data were being written and the metadata properties that were needed to function at the highest level. In the screenshot below you’ll see that there is only one metadata property being set on the object, and it is the incorrect MIME type shown above in the header responses.
Getting the metadata properties such as Content-Type correct can be a big challenge, depending on the methods you use to upload your objects to S3. There are several different and competing ways to copy files with the AWS Command Line Interface (CLI), and there are several more different and competing ways to upload files using the AWS Software Development Kit (SDK) libraries. Depending on how the object is uploaded, the MIME type (via the Content-Type header) will be guessed; in other cases the MIME type will be set to some default; and in other cases the MIME type will not be set at all.
This gave us the first clue to why the cache miss percentage was so high. We determined the correct minimum amount of metadata for images, shown below.
Examining our S3 buckets and objects definitely uncovered opportunities for improvement and so far our design goals were correct: we simply needed to update the way our assets are uploaded and have appropriate metadata set for each object.
To achieve good caching results in CloudFront, specific attributes need to be configured for the distribution. By design, CloudFront uses every header and value in the request as a cache key. This overly generous tactic is arguably safer than restricting which headers and values to use, but has the downside of being extremely granular and inefficient for good cache hit rates. CloudFront will allow you to select a whitelist of headers to be used for the cache key on the request, so you should choose wisely when setting up a behavior inside a distribution. In the screenshot below we show you the default settings which would be suitable for a well-behaved dynamic site.
Our design for static images and objects called for storing objects in S3. Public S3 objects do not need or even want to receive any headers from the client to successfully deliver the content. Thus, we can strip all the headers (including cookies) for static assets behavior. We can also bump up the minimum (and default) settings for the Time To Live (TTL) attribute. We have learnt over time that allowing query strings in the cache key has benefits for testing and for “busting the cache” if necessary. It is easy to add a unique query parameter to an image request to verify that the objects being served are bypassing the cache during testing.
Unfortunately, we spotted some issues from our design right off the bat, as shown in the following screenshot.
Can you figure out what the problem is in Fig. 8? We relied on the Origin Cache Headers being set correctly from S3. As we saw in the section above, the S3 object headers were not correct and so we could not rely on them being set properly. Further investigation uncovered more distributions with settings that were well-intentioned but ultimately misguided for other static assets. The next screenshot shows how someone might have simply set the “Customize” flag on the object caching parameter and left the minimum TTL setting alone.
The misleading theory behind this setting is that we don’t care what the minimum TTL is, because we’ll either set the value in the origin, or else we’ll use the default TTL value instead. The reason this is so insidious is because this setting accidentally fools you into believing that the default TTL will be used by the clients. In fact, the clients have no way of knowing the default TTL that CloudFront is using: they will only get a hint that the object is cacheable from the Etag and Age headers.
That is to say, CloudFront very well may cache the objects with the settings in Fig. 9, but the client has no knowledge of how to treat the object. This was confirmed by the number of MISS requests and MISS bytes, and the prevalence of HTTP 304 status codes. The correct settings for purely static assets that have a long expiration time are shown below.
We had finally solved the puzzle presented to us by all the symptoms we had diagnosed above.
Knowing Is Only Half the Battle
G.I. Joe was famous for teaching kids that knowing something was half the battle. However, the unspoken problem with this saying is that the other half of the battle is what determines the outcome. We need to move from only knowing the problem to doing something about the problem.
We were now faced with figuring out how to fix the issues presented above. We knew how to fix an individual problem with, say, one behavior setting for one CloudFront distribution. Adjusting the settings on multiple behaviors for multiple CloudFront distributions would be relatively straightforward, but rolling out changes to several production distributions could be a risky and lengthy process.
The problem scale explodes drastically, however, because we needed to also fix all the Content-Type and Cache-Control metadata headers on millions or billions of S3 objects across several buckets. If we merely changed the CloudFront distribution settings, we would not be able to take full advantage of the client side cache, and so we would only partially solve the backend problem without fixing the frontend problem facing our consumers.
We needed a way to update the response headers coming from S3 responses for every request to CloudFront. We also needed a way to correctly pass any headers that were being sent correctly from S3 if they existed. The long-term goal would be to fix all of the metadata properties in the millions or billions of objects, but that would take a long time to execute and would involve changes across several teams with multiple pipelines across many buckets. We needed something more global that we would use to accomplish this quickly to fix the issues we were seeing.
We already had a Lambda@Edge process running on a lot of our CloudFront distributions for the origin response phase. We decided we could write some code to detect an S3 origin response, check for improper or missing headers, and then insert or change the headers for the client. Included below is a code snippet in NodeJS for the Origin Response phase of Lambda@Edge.
We performed a quick test with one image and peeked inside the developer tools for the browser. The results were massively better with the correct MIME type, cache-control headers set, and no warnings issued in the browser or in our test reports!
Confident that these fixes were working, we began rolling out the changes in Lambda@Edge. In the following image you can easily spot when the Lambda@Edge header fixes were deployed for one distribution. This distribution contains billions of car images that are static and need to be displayed for our consumers to see the vehicles they are shopping for. Cache miss requests dropped from 50–70% to less than 20%. Cache hit percentage increased from 40% to 50%.
In the following image, bytes transferred by CloudFront from misses decline as well. Overall, bytes from this distribution dropped from 12TB per week to 3TB per week. That is a 75% reduction in bytes transferred by misses, achieved just by changing the headers. We marked the deploy date with a red arrow to emphasize the drop. It is not as clear or dramatic as the image in Fig. 11.
In the following image, the drop-off for cache miss requests is significant, and even approaches the small error rate. The percentage of cache miss requests drops from 40% to below 10%. (The error rate percentage seems high, but that is because this distribution serves a lot of unrelated legacy content.)
Similar to the other use cases, the following image shows a sharp decline in bytes transferred by cache miss requests.
Furthermore, a longer term view shows how cache misses decreased and cache hits have increased for one of our larger distributions. The cache misses decreased by fixing the settings in cloudfront and adding headers via Lambda@Edge. The cache hits increase by extending the minimum TTL to cache the long tail images.