This is an overview of how the Financial Times serves requests to www.ft.com. Starting with our domains, going all the way down to our Heroku applications, and through everything in between.
Table of Contents
- Domain Name System
- Content Delivery Network
- Service Registry
- The End Result
Domain Name System (DNS)
Two of our most important domains are
ft.com (which is also known as the apex domain). Both domains point to our content delivery network (CDN).
Typically the CNAME record for
www.ft.com will resolve to the same four A records as
;; ANSWER SECTION:
www.ft.com. 3205 IN CNAME f3.shared.global.fastly.net.
f3.shared.global.fastly.net. 14 IN A 22.214.171.124
f3.shared.global.fastly.net. 14 IN A 126.96.36.199
f3.shared.global.fastly.net. 14 IN A 188.8.131.52
f3.shared.global.fastly.net. 14 IN A 184.108.40.206;; ANSWER SECTION:
ft.com. 13336 IN A 220.127.116.11
ft.com. 13336 IN A 18.104.22.168
ft.com. 13336 IN A 22.214.171.124
ft.com. 13336 IN A 126.96.36.199
Fastly maintain servers in over 50 locations around the world, but we only see 4 IP addresses in our DNS queries.
So how does our traffic end up talking with the closest available Fastly server?
Fastly manage traffic on their network using the border gateway protocol and Anycast routing, allowing them to send requests to the nearest point of presence while avoiding unplanned outages and locations that are down for maintenance.
Anycast is a network addressing and routing method in which datagrams from a single sender are routed to any one of several destination nodes, selected on the basis of which is the nearest, lowest cost, healthiest, with the least congested route, or some other distance measure.
Fastly route around outages in two ways. The first is at the DNS layer, updating their DNS records to avoid the problematic location. The second way is at the network layer, broadcasting new routes using BGP, this alters the path that a request’s TCP packets will take between routers.
At the end of all this we eventually connect to a Fastly server, so what happens next?
Content Delivery Network (CDN)
We use a CDN to reduce the number of requests made to our applications running in Heroku.
Much of our content is the same for all users, typically only a little different if you are logged in or not. If we cache these different versions in the CDN we can serve requests without even bothering the Heroku applications.
Our setup allows us to cache ~94% of all requests, with a cache hit rate of ~90%. So if we see something like 9,000,000 requests during a morning peak, by using the CDN’s cache we only pass on ~900,000 requests to our Heroku applications.
Let’s take a look at the caching headers for our home page (add a
Fastly-Debug: 1 header to your request to see all these response headers).
GET / HTTP/1.1
Fastly-Debug: 1HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
Date: Fri, 24 Nov 2017 09:24:39 GMT
Cache-Control: max-age=0, no-cache, no-store, must-revalidate
Surrogate-Control: max-age=86400, stale-while-revalidate=86400, stale-if-error=86400
Vary: Accept, Accept-Encoding
Here the main response headers we’re interested in are
Age defines how long this response has been cached by Fastly, it also helps to indicate that this response was successfully served by the cache.
Cache-Control defines several directives, but in summary is saying this response should not be cached.
Surrogate-Control however is stating in the
max-age that the response can be cached for 24 hours, the directives value being defined in seconds. In Fastly, a response with this header will be respected over any
Cache-Control header. This allows us to define different caching rules for browsers and the CDN, as browsers ignore the
We also define
stale-if-error directives, which tells Fastly that we are happy to serve responses from the cache, even if the cached object’s
Age has exceeded what’s defined in
stale-while-revalidate allows us to respond using a stale response while grabbing a fresh copy in the background, ensuring we’re responding to requests as quick as possible.
stale-if-error is critical to how we deal with outages and errors, it tells Fastly to serve from the cache, including stale responses, if the backend is responding with errors. This gives us time to fix issues while reducing the impact to our users when things go wrong.
A quirk of Fastly means you must specify a
max-age of over 61 minutes to ensure your response is cached on disk, and therefore available for a longer period of time in the CDN to serve stale. A cached object that’s only in memory can be removed for several reasons well before it’s deemed stale.
Cache-Control directives are part of an extension to the original Caching specification and are also supported in modern browsers.
Vary header in the response is another part of the Caching specification.
It allows us to store different versions of a response depending on headers in the request.
Accept-Encoding header in a request, and lets say we make two requests, the first with
Accept-Encoding: gzip and the second with no such header, both to
We will actually serve two different responses, the first will come back with a header of
Content-Encoding: gzip and will be compressed using gzip. The second will not contain a
Content-Encoding header and will be uncompressed.
It would be pretty bad for us to serve a compressed version of the page if the client does not ask for it. For this reason we must cache these responses separately, and this is where the
Vary header comes in.
In this example we would respond with
Vary: Accept-Encoding. This indicates that caches should store a separate version of the response depending on the value of the
Accept-Encoding header in the request. Such caches include a client’s browser and our Fastly service.
For the website we actually take this a step further within Fastly and include several request headers that are decorated in preflight (as discussed later), so that when we serve different responses for A/B tests for example (see
Vary: FT-Flags) we are still able to cache them in the CDN.
Given we tell Fastly to cache our front page for a whole day, how are we able to serve the latest version of the page to all our users?
By using the Fastly API we are able to purge the cached content. We also have an event driven system (using AWS Kinesis) that knows when content has changed, we can use this information to issue purge requests and serve the very latest news to our users.
Fastly supports several types of purging. The most simple method is to issue a hard purge by URL, but this may result in a slower response for a few users.
Our autonomous systems make heavy use of soft purging by surrogate key, as this should result in no end user impact, and ensure all related content is purged, even if it exists on multiple URLs (e.g.
How does soft purging result in no end user impact? It is very similar to what we discussed earlier in our use of
stale-while-revalidate. Soft purging in essence marks cached responses with the given surrogate key as stale, even if they are still fresh according to their
max-age value. This then allows Fastly to serve the stale response until they’ve fetched a fresh version in the background.
The Fastly Black Box
Fastly maintain their own fork of Varnish and have heavily modified it to suit their platform, so while this means we define our logic in Varnish Configuration Language (VCL) we must refer to the Fastly documentation more than Varnish’s.
For www.ft.com however we are not using the h2o part of the Fastly black box. In order to support TLS 1.0 and 1.1 for IE 10 support we are instead pointing at a different bit of their infrastructure to handle the TLS termination.
An important part of what we do to a request in Fastly is decorating it with a whole bunch of metadata (e.g. session state, A/B test groups, etc.). This is handled by our Preflight application.
There’s a complex bit of VCL that passes the request to Preflight, takes the response and enriches the original request, then restarting the Varnish state machine to either serve the request from cache or fetch a fresh response from our applications.
What make us Platinum?
Simplifying what happens in Fastly, we allow their platform to do a bit of caching, and every now and then ask our applications for new content.
To be platinum we must be able to serve request from two regions, to cater for an outage of a whole region. For us that means we run in Heroku’s EU and US regions.
When Fastly does talk to our applications it actually runs through a snippet of VCL that determines which region should serve the request. Ideally this is the closest region to our visitor (e.g. a request from New York should be served by the US Heroku region). However if a region is unhealthy, which we continually monitor for, our Fastly service will fallback to the other, hopefully healthy region.
This is a Heroku application, which lives at https://github.com/Financial-Times/next-preflight.
Preflight forwards the user’s request for a web page to several other FT APIs in order to decorate the request with various properties.
Preflight gathers test information from our Ammit service, vanity URLs from our URL management service, subscription information from membership’s Access service, barrier page information from our Barrier Guru service, and finally session information from membership’s Session service.
By doing this in Preflight in combination with Fastly we avoid having to do all this work in each of our applications, they can just make use of the decorated request.
This is another Heroku application, but a little different from our typical Express.js applications.
It lives at https://github.com/Financial-Times/next-router.
The router is a simple streaming HTTP proxy that takes a request and passes it on to the correct application. We define where requests should be sent to in our service registry, for example requests to
^/search are directed to the search page Heroku application.
Our service registry is a basic JSON document that is hosted as a platinum service. It’s stored in S3 across two regions, and uses a similar setup to our ft.com Fastly service to serve from both regions.
Here’s a little example snippet with some extra details removed. You should be able to spot a path,
^/__foo-bar, and a Heroku app
"description": "An example service.",
Heroku and the Host Header
As an aside, it is worth discussing how Heroku knows where to send requests.
Heroku is a platform that only supports HTTP/1.1 requests, as it depends on the
Host header to know which application should receive a request.
This is why we have applications called
foo-bar-us.herokuapp.com so that we can set the
Host header in the router and send them requests accordingly.
While you can add custom domains, for the reasons above you cannot set the same custom domain on two different Heroku apps.
We use components to share common functionality between all our applications, some examples being
Typically the data sources for these applications will either be our Elasticsearch clusters, or the Next API.
Once our application has handled the request, it’ll travel all the way back through the stack, hopefully be cached by Fastly, and then sent on to our browser 🙌.
With our microservice based setup, no two applications are the same. Because of this while www.ft.com is a platinum service, we also don’t offer support for the whole site 24/7. Our range of service “metals” is either bronze or platinum, though you may see gold and silver mentioned around the rest of the company.
The main difference between bronze and platinum is that a bronze service only needs to run in a single region, while a platinum service, as discussed previously, must operate in two regions.
We run a platinum tier Elasticsearch endpoint, using two highly available clusters in two distinct AWS regions.
These clusters are our store of all content for www.ft.com, and is addressed using a single DNS record.
How does it work? We use a service provided by Dyn called Traffic Director, allowing you to achieve similar routing results to what we do in Fastly for www.ft.com.
The domain has two pools of addresses, one points at the US Elasticsearch cluster, the other at the EU cluster. If everything is healthy then Dyn advertises the closest pool to the request. If a pool is unhealthy then Dyn will not advertise it, falling back to the other healthy pool.
The difference between this and how we achieve platinum in Fastly is that this setup is entirely DNS based, and so when issues occur we will be advertising a different CNAME record (whereas in Fastly this all happens inside Varnish).
The End Result
What follows is a simplified overview of our stack.