Our journey towards better application performance

Bethink’s e-learning platform traffic varies heavily over time. It reaches its peak just before the final exams. This is a really stressful time for our users. We surely do not want to contribute to it with our platform being slow or — the worst! —being down. After all our mission is to help them in their learning process.

Published in

Bethink

3 min readJul 9, 2020

Easier said than done. Things brake, code have bugs, systems become slower and slower. And at some point they reach the melting point.

We’ve all been there and bethink is not an exception. We learned on our mistakes and in this post we’d like to share our findings, tooling and approach to make our application more reliable, performant and easier to monitor. And last, but not least — our PagerDuty shifts less stressful ;)

The Meltdown

It all started in the evening of some late September day last year. MySQL server became saturated with connections (and eventually reached 100% of CPU usage), which caused our platform to respond slower and slower until it reached the point of The Meltdown.

Our site was down. Just a few days before the final exam our users were preparing for.

We instantly started debugging the situation and turning off heavy and not critical features. This gave some breathing space for our platform. We were back to normal after around 24 hours.

So, why did it happen?

Our e-learning platform is a single page application (SPA). In such applications the most heavy operations take place on initial page load — we need to fetch current user information, course structure, establish WebSocket connections etc. Any subsequent user actions usually require less work to be performed on the backend side. Then only small bits of information need to be exchanged between the frontend and the backend.

And here is where the problem lies. When the platform backend becomes slower (for any reason) users do not get feedback fast enough. They assume something is broken. So they hit “refresh the page” button. Another set of heavy requests hit the backend, while it is still processing the previous ones. A new database connection is established, while the previous one is still open. The loop tightens…

What did happen?

One of the main features of our platform is the questions bank. It contains few thousand questions that help our users check their knowledge before they approach the exam.

In our database we have a dedicated table that stores all answers for all questions from all users. As we scale the size of this table grows too —up to millions of rows. This has became the bottleneck - hanging update queries to this table were one of the main reasons of The Meltdown. Indexes were not optimal there, we were not removing old entries from there. It was basically a ticking bomb. And it exploded in the worst possible moment.

This obviously was not the only reason. We use Laravel Echo to broadcast instant notifications to our users. As users were refreshing the page, they were reconnecting to Echo. Echo server was then reestablishing its own connection to the platform backend. It all added up.

Quick fixes and solid improvements

As mentioned earlier, we managed to reanimate our platform in several hours. And filed tickets for later to improve the things and “make the September never repeat itself” (as we called it internally).

Unfortunately, we had important projects in progress. Our small team was simply not able to take a better look at root causes back then. And when we did have time, Grafana plots were simply empty for September. On the other hand, thanks to that we took a broader look at our platform and stack. We’ll describe that in details in the next blog post. Stay tuned!

From http://www.ranzey.com/generators/bart/index.html

Takeaways

Never underestimate the importance of maintenance work and performance improvements. These problems will hit you. Most likely sooner than later.
Always take notes, make screenshots from your monitoring tools, keep logs when production issue takes place. They will be gone in few weeks from now. MySQL node CPU usage plot is way better than guessing what really happened from Slack messages history.

In the next episode

We’ll take a closer look at the performance sprint that we did when everything settled down. We’ll describe tools that we used, what we discovered and what we fixed.