How we reduced our application response times by 80% without touching a single line of code

Published in

Bethink

6 min readJan 12, 2021

As I mentioned in my previous blog post, our e-learning platform traffic varies heavily over time. It reaches its peak just before the final exams — in February and September.

As the traffic grows, CPU and memory resources use grows too — the application needs to scale well to avoid consuming all available server resources.

The story behind

This became our case last week, when gradually increasing traffic caused our Kubernetes cluster available resources to shrink rather rapidly.

CPU usage (green line) vs requested resources as defined by k8s pods

We started 2021 with almost doubled CPU usage when compared to Christmas period. Then, on 7th January our site stopped responding! The only quick way to fix it was to increase the requested resources on k8s level — surely a dead-end street. That’s why we decided to create a small, three person task force to take a better look at the root cause the following day.

Elastic APM gives a hint

We use Elastic APM to take a deeper look into the performance of specific aspects of our application. With APM set up on production we are able to dive into a single request and see what kind of operations are performed as a part of it — database queries, Redis operations, HTTP requests.

That’s exactly what we did. We noticed that there is a great part of request time that is not used by any of these three, aforementioned operations. It seems that either there’s asleep call inside the code or that something consumes quite a lot of CPU or I/O (cryptocurrency mining?). Something that is not visible when looking at requests through Elastic APM.

We observed similar thing on a local instance of Elastic APM. This made debugging the issue way easier.

Low-level tools to the rescue

It quickly became clear that we need to dig deeper into the issue. Together we identified two paths to follow — install PHP profiling tool and analyse what PHP processes are actually spending so much time on. I followed the latter path.

Because we all love acronyms… Source: https://dilbert.com/strip/1993-01-12

In the past (when “PHP 5.4 running as the Apache module” was the state of the art) I used strace several times to debug performance issues. I said to myself — why not give it a chance again? It will give me a better, deeper look into what PHP actually does when handling the end-user request.

We use Docker containers on our local development environments, so first thing needed was to turn on sharing of PID address space between container and the host OS (that is “my laptop” in plain English).

$ pgrep php-fpm | paste -s 1264900 1265153 1265294 1265295 1677937 1678513 1679069

Now I was able to inspect locally running php-fpm processes from my laptop using strace.

$ sudo strace -t -s 1024 -f $(pgrep php-fpm | paste -s | sed -e 's/\([0-9]\+\)/-p \1/g' -e 's/\t/ /g')

After a few minutes of going through the waterfall of lines and grepping quite a noisy strace output I managed to get something easier to read and understand. And what I saw, was actually a bit hard to understand:

[pid 1679069] 11:39:59 open("/code/bethink/wnl-platform/app/Models/Order.php", O_RDONLY) = 7
[pid 1679069] 11:39:59 open("/code/bethink/wnl-platform/app/Http/Controllers/Api/Transformers/OrderTransformer.php", O_RDONLY) = 7
[pid 1679069] 11:39:59 open("/code/bethink/wnl-platform/app/Models/Order.php", O_RDONLY) = 7
[pid 1679069] 11:39:59 open("/code/bethink/wnl-platform/app/Http/Controllers/Api/Transformers/OrderTransformer.php", O_RDONLY) = 7
[pid 1679069] 11:39:59 open("/code/bethink/wnl-platform/app/Models/Order.php", O_RDONLY) = 7
[pid 1679069] 11:39:59 open("/code/bethink/wnl-platform/app/Http/Controllers/Api/Transformers/OrderTransformer.php", O_RDONLY) = 7
[pid 1679069] 11:39:59 open("/code/bethink/wnl-platform/app/Http/Controllers/Api/Transformers/OrderTransformer.php", O_RDONLY) = 7

Yes, it seemed that PHP opens and reads the same file several times in a row! A quick “uniq, sort and grep” magic showed that some files were open even 18 times (!) in the same request. Overall — a simple request performed over 880 file open operations!

Something was obviously going wrong here. OS does have the file cache, but seriously, PHP — reading the same files over and over again?!

And then a few seconds later I recalled one PHP extension…

opcache

Zend’s opcache documentation says:

The Zend OPcache provides faster PHP execution through opcode caching and optimization. It improves PHP performance by storing precompiled script bytecode in the shared memory. This eliminates the stages of reading code from the disk and compiling it on future access. In addition, it applies a few bytecode optimization patterns that make code execution faster.

“eliminates the stages of reading code from the disk and compiling it on future access” — this sounds like something we’re looking for. Thanks, opcache.

I quickly modified Dockerfile used to build a local, “for development” application container. With opcache enabled, I reloaded the request that I’ve been investigating using strace. I started getting responses in ~160 ms instead of over 300 ms. File open operations dropped from over 880 to around 70.

Staging tests

This was quite promising. I’ve modified opcache config a bit to make it better suited for production:

RUN docker-php-ext-enable opcache \
  && echo $'zend_extension=opcache\n\[opcache]\n\opcache.enable=1\n\opcache.enable_cli=1\n\opcache.validate_timestamps=0\n\opcache.max_accelerated_files=10000\n\opcache.memory_consumption=128\n\opcache.max_wasted_percentage=10\n\opcache.interned_strings_buffer=16\n\opcache.fast_shutdown=1' > /usr/local/etc/php/conf.d/docker-php-ext-opcache.ini

I build an image and deployed it on our staging environment, which is set up (almost) the same way the production is. The most important — it runs on Kubernetes cluster too. This should give me more realistic numbers regarding possibly performance gains.

With this handy small browser console “script” I’ve run my little benchmark:

async function measure(id) {
    console.time('request #' + id);
    await axios.get('/the/resource/path');
    console.timeEnd('request #' + id);
};

for (n=1; n<50; n++) await measure(n);

Without opcache I got the median of the response time of around 450 ms. With opcache enabled — 150 ms. That’s roughly one third of the initial value.

Production results

Despite the Friday afternoon, we decided to deploy this change to production. And the results surprised us even further. Just take a look…

APM response times — average value dropped by over 50%.

Median response time on production used to vary between 200 and 600 ms. Now, for several days so far, it stabilised itself at around 60 ms!

Virtual CPUs used by our application pods dropped from 5 to less than 1.8. The same applies to memory consumed.

“Simple”, two lines change in our Dockerfile (that enabled opcache) resulted in a great improvement of application response times and its resources consumption. This surely gives us a lot of breathing space for the future.

I’ve put the word “simple” in the quotation marks — the change was indeed simple, but only when I knew what’s going on.

PHP profiling tool proves very useful too. By going deeper into the code we managed to greatly improve one of our slowest and most frequent requests — its average response time dropped from around 2 seconds to 350 ms. But that’s a topic for another story.

Thanks to php-spx profiler we managed to reduce number of calls to internal Laravel method from 430k to 22k during a single request. This has a great impact on the response time.

Takeaways

Few things came to my mind when I was putting together this blog post:

UNIX tools may look so 90’s (or even 80's), but even nowadays can provide extremely valuable insights and data.
Try to set up your development environment to be as similar to production as possible. Being able to reproduce the same “strangeness” on my local environment helped a lot here.
Code is not everything. It‘s build on top of a framework. With a specific language being used. It’s being run on your laptop, VM or as a Kubernetes pod. Being able to understand all these layers and their impact is the true meaning of being the “full stack developer”.
Know your tools. Does your framework provide any sort of caching or profiling? Does the language you code in provide extensions similar to opcache?
Monitor your application. Know when it’s bad and why. And know when you improve things.
Sometimes you need to guess. Make your guesses educated at least.