Keeping it Fresh on the Web: HTTP Content Revalidation

Eric Friedrich
disney-streaming
Published in
7 min readSep 23, 2020

Delivering video to millions of Disney+ subscribers means moving massive amounts of video data across the Internet. The less data Disney+ needs to distribute, the faster video is made available to customers, and the better the viewing experience. Improving this experience is a key goal of the Content Delivery team, and the purpose of a content distribution pipeline. While there are many stages to this pipeline, one optimization applied universally across all stages is HTTP content revalidation. Content revalidation is a tool for reducing the amount of content transferred between Disney+’s servers and client devices.

A promise to do a fraction of the amount of work while expecting better results is best met with skepticism. Here though, a few key assumptions about the content and some simple technology combine to provide just about the closest thing in networking there is to a free lunch.

Content on the web moves fast and is always changing. News websites carry the latest headlines, and social media influencers upload dozens of posts daily. Although the top stories update frequently, the contents of an image or video stream doesn’t change once published. New photos may be added, but PictureOfCargoTruck.jpg is the same viewed today or two weeks in the future.

Lots of content is static and changes infrequently if ever: photos and artwork, JavaScript, and of course video content. Inside of Disney+, published media is occasionally changed for various reasons, such as to update encoding or packaging.

Caching and CDNs

System Block Diagram of the Disney+ CDN
Figure 1: Disney+ Content Delivery Networks

Hypertext Transfer Protocol (HTTP) is the main communication protocol for transferring web and streaming video data between a client and a server. The term “client” in this context refers to both browsers or device-based apps and also to servers that function as web caches. Every HTTP operation uses a verb to describe what the client is doing. Typically this verb is a GET to request data from a server or sometimes a POST to send data to the server. When a client issues a GET request for a specific object, the client is asking the server to send the entire contents of the object in response.

Requesting a complete copy of the content from the origin servers on every view is impractical, so the CDN spreads out the request load amongst many servers or caches that temporarily store the most popular content. When video content streams from Disney+ to customer devices, the content works its way through a series of servers designed to scale to the millions of global subscribers that use the service every day.

Multiple tiers of servers comprise Disney+’s CDN:

  • Origin: Dense storage servers that hold complete copies of the D+ catalog.
  • Origin Protection: Multiple data centers of shield caches to protect the origin from surges in traffic.
  • Commercial CDNs: Deliver the video down the last mile onto subscribers devices.

When a request for content arrives at a server, the cache checks its local storage for a copy of the object. If the object is present (cache hit) and fresh (more on this below), then it is delivered to the client. If the object is not present (cache miss), then it is fetched from the next server up the pipeline. For example, subscribers request content from the CDNs, CDNs from the origin protection layer, and those from the origin.

Object Freshness

Transferring a new copy of an unmodified object for every request is unnecessary, even for cache hits. Often, it’s enough to keep a local copy in a cache on a server or a browser for some duration (known as the Time-To-Live, or TTL). Any accesses to that object within the TTL are unchecked as the object is considered fresh. Think of this as letting a cucumber sit in a refrigerator for the first few days after the purchase.

Once that TTL expires, the object becomes stale and needs to be refreshed or revalidated before it can be used again (similarly to how a cucumber gets a good squeeze every few days to make sure it hasn’t gone bad). Rather than transfer a full copy of the object, HTTP clients perform a revalidation to check if the object is unmodified. If so, the local copy is used as-is. Otherwise, if the object has changed, a new copy of the object is delivered and may be stored for later reuse.

The mechanics of content revalidation are based around the ideas of a “conditional GET” request along with a freshness check or “validator” algorithm. For example, a command line HTTP client (curl) requests a logo file from the Disney+ webpage.

Figure 2: HTTP GET Request (unconditional)

The response includes metadata about the object in the HTTP Response Headers that is useful to the revalidation process. All of these headers are optional and will not be present in every response. There may also be conflicts between these headers, for which there is a detailed precedence to resolve.

Making It Conditional

If a browser returns to Disney+ any time within the next week, the logo is fresh and the browser can display the image without needing to ask the server. Once those seven days expire, the logo becomes stale. The browser checks if the existing logo object has changed with a conditional GET. There are two main forms of a conditional request, one based on the ETag checksum and another based on the Last-Modified timestamp.

Just as HTTP responses have header metadata, so do HTTP requests. Request headers can be used to communicate properties about the request, like its formatting, or to modify the server’s behavior in processing the request. It is the responsibility of the server to compare the request headers to the latest content and decide if the client should receive the complete object in the response or just a brief “carry on” without the object.

Conditional GET with 304 response
Figure 3: HTTP Conditional GET

When using the ETag validator, the client asks the server to return the updated object only if the content stored on the server has a different checksum. To compare ETags, the client sends a request with the current ETag in the “If-None-Match” request header. When using the Last-Modified validator, the client asks the server to return the updated object only if the content was modified since the cached Last-Modified time using a request header aptly named “If-Modified-Since”.

If the object is unmodified, the server responds with an HTTP 304 “Not Modified” status code and no response object. The client uses its stored copy of the object, marks the object as fresh, and resets the Time To Live.

Figure 4: HTTP GET Request (conditional)

If the object has changed on the server, the response is instead a standard HTTP 200 OK as in the original unconditional GET request. Finally, the client updates the content and metadata in its local store.

Revalidation in the Disney+ CDN

While the logo.svg example looks at a relatively small 6KB image file, the true benefit of revalidation appears when applied to streaming video. A typical 8 second video segment averages 10MB in size, while a 304 Not Modified response is typically .5 to 1KB. Taken across the entire Disney+ CDN, the time and bandwidth savings are immense.

Median time to serve, 304s are much faster than 200s
Figure 5: Delta in time to serve — 200 vs. 304 revalidation responses.

Owing to their small size, these 304 responses are much faster to serve. The median 200 response is about 20 times slower than the median 304 response time. Many caches keep the metadata in memory, meaning some revalidations can be performed without a disk access to read or write content.

Daily requests per second, many more 304s than 200s

Some locations within the CDN serve 5 times as many 304 responses as 200 responses. This heavy use of revalidation, coupled with the speed improvement highlights the benefits of revalidation.

Cache efficiency is a key indicator of the performance of the CDN, measuring the ratio of egress bandwidth (hits+misses) to fetch bandwidth (misses only) in bytes. Efficiency shows how much traffic is served by the caches, rather than reaching back to the origin. Higher efficiency means more bytes are being served by the caches. The graph shows that long TTLs and content revalidation leads to a very efficient system.

Content revalidation is one of many technologies that enable Disney+ to offer great streaming experiences. With a very large, often static library, Disney+ content is highly cacheable and benefits highly from longer TTLs. Subscribers experience these benefits as faster load times, less rebuffering, and higher quality viewing experiences.

Photo by Fahrul Azmi on Unsplash.com

--

--