This one line of Javascript made FT.com 10 times slower

A journey into a performance degradation…

Arjun
Arjun
Jul 16, 2018 · 5 min read

The Discovery

This had a real impact on our users, with thousands of error pages being shown (somewhat offset by caching in our CDN).

Image for post
Image for post
Error pages seen by users

Looking at the error logs for the app showed that the application didn’t have any data for the top stories.

Image for post
Image for post

The Diagnosis

FetchError: response timeout at https://….&source=next-front-page over limit: 5000
Image for post
Image for post

Curiously, the response times for the API seemed well below the 5 second timeout that the front page was setting. This led us to believe that the issue was somewhere in the connection between the front page and the app. We tried a few things — using a keepAlive connection between the two, scattering the requests so they won’t all happen at the same time. None of these seemed to make any impact.

Image for post
Image for post

What added to the mystery was the response times shown on Heroku. The 95th percentile was around 2–3 seconds, with a max sometimes reaching 10–15 seconds. Since the front page is heavily cached by Fastly, including stale-while-revalidate headers, this likely wouldn’t have been noticed by many users. But it was strange because the front page really shouldn’t have been doing a lot of work to render a page. All the data was in kept in memory.

Image for post
Image for post
Heroku response times for the front page

So we then decided to do some profiling against a locally running copy of the app. We would replicate some load by using Apache Bench to make 1000 requests, with 10 requests a second.

ab -n 1000 -c 10 http://local.ft.com:3002/

Using node-clinic and nsolid, we could gain some insights into what’s going on with the memory, CPU and application code. Running this confirmed that we could reproduce the issue locally. The front page took between 200–300s to complete the test, with over 800 requests unsuccessful. In comparison, running the same test on the article page took around 50s.

Time taken for tests: 305.629 seconds
Complete requests: 1000
Failed requests: 876

And lo and behold, the graphs from n-solid showed us that the event loop had a lag of over 100ms.

Image for post
Image for post
Event Loop lag whilst running load test

Using n-solid’s CPU profiler allowed us to pinpoint the exact line of code blocking the event loop.

Image for post
Image for post
Flame chart showing the function causing the lag

The Fix

return JSON.parse(JSON.stringify(this._data));

For every request, we were using JSON.parse/stringify to create a deep clone of the data. This method isn’t bad in itself — and is probably one of the faster ways of deep cloning. But they are synchronous methods, and so will block the event loop whilst being executed.

In our case, this was being called several times on each render of a page (for every section being rendered), with a large amount of data (the data required for the entire page on each execution), and we get several concurrent requests. Since Javascript is single-threaded this would have a knock-on effect on everything else the app was trying to do.

The reason we were deep cloning the data was because there were some things that were mutating the object, based on some information from the request (for example, if a particular feature toggle was on).

To fix this problem - and alleviate the need to clone everything - we applied a deep freeze to the object on retrieval, and then where it was being mutated we could clone the specific bits that needed to be mutated. This still performs a synchronous clone — but only on a much smaller subset of data.

The Results

Time taken for tests: 37.476 seconds
Complete requests: 1000
Failed requests: 0

Once we released the fix to production, we saw an instant decrease in response time and errors (🤞🏼), and hopefully some happy users!

Image for post
Image for post
Front Page response times after the fix
Image for post
Image for post

The Future

  • Can we get more visibility of our event loop?
  • Our front page should have stale-on-error — so why were still thousands of error pages seen? Is that number good or bad?

FT Product & Technology

A blog by the Financial Times Product & Technology…

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store