Reducing the response time to first byte using cache first–process later method

A faster website means happier users, more customers and higher profits. We all know that and let’s be clear: nobody likes to wait for a page to load.

Content on most websites hardly change. Even if you would serve a new version only each minute or five, most of the time you’ll be fine.

If this is the case, here’s how Cache First–Process Later could work for you:

  1. Server (application) receives request and sends previously cached request response;
  2. Server (application) continues to process actual request;
  3. Server (application) stores result of processing in cache.

This way the user sees previously cached response at immediately and in the same time, the server still processes the request in the background. This way the user is not kept waiting for the response that has most probably not changed a bit from the previous one.

The cache can be kept fresh without much hassle using this method.

First reason behind the approach are these important limits for sites to feel responsive according to NN:

  1. 0.1 second: Limit for users feeling that they are directly manipulating objects in the UI;
  2. 1 second: Limit for users feeling that they are freely navigating the command space without having to unduly wait for the computer;

10 seconds: Limit for users keeping their attention on the task.

Second reason is cache invalidation. I’ll talk about it later.

Timing of a request

Here’s how a typical request to a shared hosting looks like:

Typical timing of a request

In the times when users are used to navigate native apps with high-end performance and data delivery, 1 second is way too much time to wait for any website to start sending data.

We’ll skip all the parts that are browser/network dependent. There’s not much we can do about any of these anyway. (See detailed explanation of resource loading to learn more.)

Let’s focus on what we can influence: time to processes a request.

Visualization of typical response timing.

Waiting = Time to First Byte (TTFB)

  1. Request sent is cheap.
  2. Waiting is expensive.
  3. Downloading is cheap.

To minimize waiting we want to send content as soon as possible.

Content generated dynamically by the server requires extra time to boot code interpreter, to compile code and to setup and process application. Only after going through all the steps server starts to send the response data. Typical example here is any WordPress powered website.

On the other hand, there’s no waiting for the static content, e.g. images. Serving of static content starts almost immediately.

That’s how caching of content works too. You’re never sending fresh content, but an older yet reasonably fresh copy stored in the static file (or other fast-access memory unit). Determining between serving content from cache and regenerating fresh version of content is called cache invalidation.

Cache invalidation troubles

Few approaches to overcome cache invalidation:

  1. don’t use cache at all, send fresh content on each request (makes no sense, but I list it anyway for the later reference);
  2. use a time window (TTL — time to live) during which we can consider cache still fresh;
  3. use advanced signals, e.g. content update to invalidate the cache.

Strategy #3 looks straightforward: delete the corresponding cache of content after each time it is updated. This approach grows in complexity as soon as the content occurs in more than one place of the website.

Complexity of the cache invalidating increases with the website complexity.

What if we could serve content from cache with some reasonable TTL (approach #2) and still re-generate content on each load (approach #1) in the same time totally skipping any advanced cache invalidation techniques?

Meet the response first, process later method

Let’s set few conditions how this process could work:

  1. serve GET requests only;
  2. always serve from cache first;
  3. close connection and send content length before server finishes processing request in the background;
  4. update cache only if it’s age is more than the TTL: this way we’ll skip regenerating of the same content more than it is needed and save some server CPU/RAM resources.

Here’s how it all put together could look like:

The grey bar represents the actual request processing taking place on the server (#3). Notice how the blue bar (downloaded content) finishes way sooner. That’s the cache already delivered to the browser. Perceived performance is great even though the content of the response might be a little bit outdated. Serving content from cache is always a tradeoff.

To send content early you need to send two headers:

  1. Content-Length: {n}
  2. Connection: close

These two tell the browser to stop (close) and when to stop downloading the content (known length) from server without waiting for the process to finish on the server.

Note: If browser supports Gzip compression, send the compressed content and its length using the script. According to my findings with the Apache server, server will take over and tries to compress your output. But waits for the script to exit while letting browser wait thus negating all of our efforts.

A drop in implementation in PHP

I’ve created a repository with a drop-in class so you can try the approach. It’s not yet polished, but it works as a proof of concept. Go ahead, try it and test your page speed using the Google PageSpeed tool before and after enabling the cache.

This method will not be suitable for any-case solution. But I believe this could be a game-changing for many websites right away. Ideas are always welcome, so please if you have some, open an issue on GitHub or leave a comment.

If you liked the post, hit the heart below to recommend the post.

Thanks for reading.