Sitemap
Miro Engineering

Learn about Miro’s engineering culture, technology issues and product developments. We are hiring: miro.com/careers

Profiling in production to detect server bottlenecks

--

I work as a tech leader in the System team responsible for service performance and stability. From March to November 2020 Miro grew sevenfold, hitting up to 600+ thousand unique users per day. Currently, our monolith runs on 350 servers, and we store user data across about 150 instances.

The more users interact with the service, the harder it is to find and to eliminate bottlenecks in its servers. Here is how we solved this problem.

Part one: introduction and task definition

The way I see it, you can represent any application as a model consisting of tasks and handlers. Tasks are queued and executed sequentially, as shown in the figure below:

Not everyone agrees with this statement of the problem: some would say that RESTful servers have no queues, they only have handlers — methods to process requests.

I see it differently: not all requests are processed at the same time. Some are waiting for their turn in the web engine’s memory, in sockets, or somewhere else. There is a queue somewhere, anyway.

At Miro, we use a WebSocket connection to support collaboration on a board, and the server itself consists of many task queues: there is a separate queue for receiving data, for processing it, for writing it back to sockets, and for writing it to persistent storage. Accordingly, there are queues, and there are handlers.

Based on our observations, there are always queues; the whole use case is built on this foundation.

What we looked for and what we found

After finding the queues, we started looking for metrics to monitor the execution time of processes in the queues. Before we found the right metrics, we made three mistakes.

The first mistake: we investigated queue size. It seems that a large queue is bad, and a small one is good. In fact, this is not always the case: queue size doesn’t by definition imply that there may be a problem. A typical situation: a queue is always empty if the execution of a task depends on another. But this doesn’t mean that everything is in order, since tasks are still slow to perform.

The second mistake: we looked for the average task completion time. We already knew that there was no point in measuring queue size, so we calculated the average execution time of the tasks. But as a result, we got figures that didn’t convey any useful information. Here’s why.

Error analysis showed us that what we needed wasn’t the arithmetic mean, but the median time to complete a task. The median value helps identify tasks that deserve attention, and it discards outliers.

In our case, a percentile value was more suitable. It helped us break down all processes in two types: anomalies (1%), and tasks that fit in the general picture (99%).

The percentile also helped us realize the third mistake: you don’t have to try and solve every single problem affecting queues. We need to focus only on those that directly affect user actions.

Let me explain: discarding anomalies (1%) helped us focus on optimizing the processes affecting the absolute majority of our users. As a result, we improved UX — and boosted product, not technical, metrics. This is a very important point.

Part two: problem resolution

In the previous section, we defined a method to detect bottlenecks. Now, let’s pin down the slowest component of the problem.

Based on our experience, we found out that we can break down the total time it takes to complete a task in two: when we are counting something, and when the processor is waiting for an input/output (IO) operation to complete.

Errors can occur in the first part of the process as well, but that’s outside the scope of this article. Let’s focus on the second part: the time to wait for a database response. This includes, for example, sending data over the network, and waiting for SQL queries to be executed in the database.

At this stage, we should be able to build into the abstraction layer (data access layer, DAL) and write a piece of code that can be called before and after the operation. In other words, the function must be observable.

Let’s consider this example: at Miro, we use jOOQ to work with SQL. This library has listeners: they enable writing code that is executed before and after each SQL query. Redis uses a third-party library that doesn’t allow adding listeners. In this case, you can write your own DAL for access. This means that instead of directly using the library in the code, you can hide it beneath your own interface, whose implementation provides calls to any handlers we might need.

The same pattern works well with a RESTful application, when a functional or a business method is wrapped in listeners. In this case, we can store the value of its counter in the incoming interceptor; then, we can get the difference in the outgoing interceptor, and we can send this value to the monitoring system.

Let’s illustrate this process using our dashboard as an example. We profile a specific task, and we receive both the total time of its execution, and the time spent on queries in SQL and Redis. We can also put different time-counters; for example, in the context of commands sent to Redis.

Information on the execution of each request is duplicated in Prometheus and Jaeger. Why do we need two systems? The graphs are more illustrative, and the logs are detailed. The systems complement each other.

Let’s look at the example: there’s a command to open a Miro board; its execution time directly depends on the size of the board. Technically, we can’t show on the chart that small boards open quickly and large ones open slowly. But Prometheus shows real-time anomalies that you can promptly react to.

We get more detailed information in logs — for example, we can make a request to see how long it takes to open a small board. At the same time, in Jaeger the data appears with a delay of about one minute. Using both tools allows you to see the full picture — there is enough information to find bottlenecks.

Stack trace for special cases

The tool to optimize slow tasks described above is suitable for situations when everything works normally. Our next step involves creating a methodology for special cases — when there is a spike in the requests, or when we need to find and optimize long queues.

Adding data access layers made it possible to snapshot stack traces. It allows you to trace requests from a specific database, for a specific endpoint or Redis command.

We can snapshot a stack trace of every tenth or hundredth operation and write logs, thereby reducing the load. Processing reports selectively is not hard — the difficult task is to visualize the received information in a form that is understandable and usable.

At Miro, we send stack traces to Grafana. We exclude from the data dump all third-party library data, and we form the metric value as the result of the concatenation of a dump portion. It looks like this: after the data manipulations, the projects.pt.server.RepositoryImpl.findUser (RepositoryImpl.java:171) log becomes RepositoryImpl.findUser:171.

WatchDog for continuous monitoring

Stack trace snapshotting is an expensive way to research, so keeping it active all the time is not a viable option. Also, these reports generate too much information that is difficult to process.

A more universal method for tracking anomalies is WatchDog. This is our library, and it tracks slow-running or stuck tasks. The tool registers the task before its execution, and it unregisters the task after it has completed.

Sometimes tasks freeze: they take five seconds, instead of 100 milliseconds. In such cases, WatchDog has its thread periodically check the status of the task, and it snapshots the stack trace.

In practice, it looks like this: if the task is still hanging after five seconds, we see it in the stack trace log. Besides, the system sends an alert if the task hangs for too long — for example, because of a deadlock in the server.

Conclusion

In March 2020, when self-isolation began, the number of Miro users was growing by 20% a day. We wrote all the features we described in this article in a few days; it did not take much time.

To be honest, we reinvented the wheel: there are commercial products that can solve our problem. But they are all expensive. This article shows how to quickly craft together a simple tool that can be as good as larger and more expensive solutions, and that can help small projects survive fast scaling.

Join our team!

Would you like to be an Engineer at Miro? Check out opportunities to join the Engineering team.

--

--

Miro Engineering
Miro Engineering

Published in Miro Engineering

Learn about Miro’s engineering culture, technology issues and product developments. We are hiring: miro.com/careers

No responses yet