A couple of weeks before Christmas we were ready to run a load test ahead of the expected Christmas peak for our online webshop. For the past month, we’d been hard at work building our new content pages. They looked great and we were excited to release them. The question on our minds: Would our app withstand the load?
Our application is an isomorphic React application running on NodeJS with full server-side rendering. Other parts of the main application run as separate applications on different clusters. Each application is deployed into Kubernetes and has one single CPU/core available by default. Our stress test would push our app to its limit, in order to find our maximum throughput (successful requests) per second (TPS).
The application itself consists of two docker containers, an API container and a WEB container. Static assets are served from our CDN.
This article is about our WEB container where our React client and server-side rendering code lives, this code is blocking. Our API containers are non-blocking and can be optimized differently. So we fired up our first test…
What really? The Java developers in the office started to laugh at us, normally they are very quiet. Secretly I hoped the server was accidentally running in development mode. But that was not the case.
This is bad, это плохо. I quickly got visions of having to call home and explain that we’re not going on holiday any time soon. That’s not really an option but seeing these benchmarks wasn’t helping. Stress! Normally the application could process 20/40 TPS per container without caching. So what happened?
The rest of this article is about what followed the next few weeks and how we got our app ready for the Christmas period. There’s no silver bullet for performance optimizations. But I hope our story will give you some insight into how you can optimize your own applications. Let’s go, let’s make things better!
What happened is that we made changes to our application which were not always improving server performance and were also not detected while making these changes.
Make snapshots and measurements
Let’s fire up node clinic (insights in CPU, event loop, flame graph, and bubbleprof) with rakyll/hey (to put load on the app) and generate a flame graph locally. Be aware of CPU consumption by applications running on your computer. Close anything that doesn’t need to run (Spotify, Chrome) and measure multiple times for the best result. And do the changes step by step.
The first culprit was easy to spot, module
jsesc was taking half of the time to process a request, and blocking the thread while running! We added this as a frontend optimization after reading this.
Our code looked like:
Although it significantly improves frontend performance during render-time, it has a big negative impact on the event loop with large initial states and SSR (more on this later). So removed.
Locally the measured performance increase was huge, once deployed the measurement improved a lot too but not by the same figures. So another lesson learned: our application was running faster on a consumer CPU than the ones used in our cluster, in addition, our cluster by default allocates one CPU/core while locally we benchmarked on a 4/8 core machine. For example, our first local test showed we can reach 400 TPS, while on the server the number was much lower. After this, we benchmarked each change not only locally but also deployed in our cluster.
Some other issues were also spotted easily. And again one feature added to improve client-side improvement was impacting our event loop negatively on the server:
After building our HTML on the server. We used a tree-walker to see which SVG icons we actually used, and inline them in our HTML before serving them up. But this was blocking our event loop significantly as well. So I removed it and added to our ‘FIXME’ list to solve this issue a different way, preferably during build time
We also noticed that some interfaces had grown out of proportion and this impacted the amount of data we were serving. Sometimes it’s nice to load more data, so you have it available before you navigate to the next route but for our main content pages, we didn’t need the additional data at all. We fixed this by splitting our data normalizers into a light variant that we could use, and a heavier one that other apps could continue to use. This again improved our performance: less initial state, higher throughput.
After some time our throughput was back to normal and hovering around 36 TPS per container. But we wanted more, much more. So what to do?
We couldn’t use NodeJS workers in our cluster since it would require more CPU/cores. And in that case, we were better off adding another container (horizontal scaling), which would give even higher throughput. So that was not an option: it’s better to place two containers instead of enlarging one. Also, React render to stream did not show any significant change.
We don’t have auto scaling yet in our cluster so the only thing left to do is add some foul play. And this part I want to share with you. Because it gave us the confidence that we would handle the expected load, and even more load if it happened. I’m sharing the code below, so you can run it yourself and see how it works.
The need for a bouncer and caching middleware
In our stress test on the build server, using Gatling, Grafana, and Perfana we achieved a healthy 36 TPS for our React SSR part with response times below 1 second and normal CPU usage.
So I subtracted 4 from this amount as a maximum to keep the container healthy and added a rate limiter as middleware. This would act as a bouncer in our web container that served the React application. The limiter adds some blocking to the event loop but also keeps it healthy by preventing the event loop from getting overloaded.
That gave us some guaranteed buffer in CPU usage which we could use to add a caching middleware. Anonymous users would simply pass the bouncer middleware, and receive a cached page. While other users would go through the rate limiter. The rationale is that I can then serve 32 recognized users totally personalized responses while the rest will pass to the next middleware (our caching middleware) and get the non-personalized highly cachable version. When that happens a lot I can scale it up, but the app will not break.
The rate-limiting middleware:
Our visitors are anonymous or recognized. Anonymous responses are highly cacheable (5/30 minutes). But we had placed caching in our API container, we moved it to the WEB server part. Here we added a rate-limiter and a cache middleware.
The caching middleware:
When the rate limiter kicks in these requests are served from cache, so even recognized visitors will get the anonymous responses after these 32 TPS.
Our single container was now capable of serving 150 TPS. And our backends were not being overloaded and kept healthy.
So the puzzle was solved, right? Yes, you could say so, it was now able to handle much more peak load without autoscaling. But it must be possible to make it even better, no? Without small performance tweaks that only kill nanoseconds off the performance.
Optimizing it even further
Our pages were becoming bigger and bigger but the user doesn’t see all of it at once, so why load all of it? Less rendering server-side will speed up the application and we can use our client-side code to get the remaining data and make it more performant for our end-user!
So I started with maapteh/react-no-ssr (the old no-ssr component was 4 years old and untouched since then). It was nice for parts that didn’t need dynamic data nor event handlers. But most of our components are dynamic, and parts will eventually be rewritten into GraphQL where we can control what to do with SSR, easily... But for static items like our footer, this was perfect. Less HTML, higher throughput. With this simple tweak, we almost doubled our throughput.
But what to do with dynamic data that we don’t initially need? That’s why maapteh/react-in-view came to live. It uses the
intersection-observer. Using this component we load components when its container comes into our view and show a placeholder component when it's not in view to prevent a total browser reflow or change of scroll position. An example of this placeholder is danilowoz/react-content-loader.
First measurements show us that we can reach 60 TPS without foul play (caching and bouncing) when we fine-tune the rendering more. This means we can handle 56 TPS without caching anything and still reach 220 TPS during peak loads with foul play on a single container.
I know it feels like cheating. But this was a nice solution for the short term when you don’t have autoscaling in place. I hope it inspires you to look at your own server performance and find places to make it better!
There is always a healthy tradeoff to do something server-side or client-side. Make it as fast as possible for your end-user without endangering your own server performance.
- https://github.com/maapteh/playground-throughput repo to play with your own generated HTML and see the middleware described above in action.
- https://github.com/maapteh/react-in-view module to load components when they come into the viewport with an example on Heroku (our best improvement without foul play)
- https://github.com/maapteh/react-no-ssr for the most simple tweak
Now I’m able to go on holiday without worrying and wish you all a very good one! I ended up with more TODO’s but I’m confident they will be squashed soon. Also, we will get autoscaling soon :)
Thank you Kian Khosh for reading and tweaking my Article and Okke Garling for your big help with setting up our Perfana load tests.
Please drop me a message if you have any questions about this article.