‘And There Was Thunder’: Escalations, Performance Issues, and How We Solved Them

WrikeDEV
Wrike TechClub
Published in
12 min readOct 12, 2021

This article will be of interest to anyone who develops complex apps with a computing-heavy front end while fighting for their customers and better performance. Wrike’s front-end team leaders, Igor Zubov and Alexey Sharov, share their first-hand experience in the matter.

At some point, one of our key product components transformed into a clumsy monster, accumulating a whole bunch of high-priority performance complaints from disgruntled customers. Rather than rewrite almost everything from scratch, what steps have we taken to rectify the situation?

We work in different teams of one product unit — Wrike for Professional Services. One of the most complex parts of the product for which our unit is responsible is Workload. It is a chart used by managers at different levels to monitor the project workload, including days when employees are overloaded with tasks, and days when they are underutilized. The manager can view all project tasks, execute planning, swap tasks or reassign them to a different person.

On the left are the users. In the calendar grid, you can see how many hours a user is busy on any given day.

Workload is not merely a diary planner, it is also a chart for project and task navigation:

The blue horizontal bar is a task. It is scheduled for four days and is in the “Pyrix” project of one of the users.

We developed Workload three years ago as a simple spreadsheet. Now it is a complex component: over 100,000 lines of code and a huge amount of calculations executed on the front-end side (many of these are optimistic: recalculating the schedule, moving the task etc.).

At first glance, it seems simple enough. Users, however, may have their own bespoke schedule: one day they work 8 hours, the next they work 3, and on the next they take a day off. Several users with different schedules can be assigned to a single task at once. Initially, we decided to implement all the logic on the front end, soa lot of calculations occur there. As it developed, the component turned out to be complex.

At the beginning of 2021, we realized that everything was bad.

How did we realize that everything was bad?

When you have spent years working with test environments and test accounts, your focus might not be what it used to be. There may not be enough data or the data might be wrong. We noticed that the component started to slow down a little but, overall, everything was still fine. It seemed to us that this reduced speed was not a critical fault: that this was just what usually happens when you rapidly saturate a component with features.

The component architecture that we created in 2018 was not scalable,suitable only for middle-segment clients. Back then, we didn’t have to process large amounts of data. As we added new features, Wrike grew tremendously as a product and company. We have a lot of Enterprise customers, and these companies have even more teams, more people, and more data.

As a result, in January 2021, we received our first complaints about the download speed in Workload. We had about 10 escalations. In total, the company has 20,000+ customers, and 10 out of 20,000 is a very small number. But these escalations appeared within a short period of time — 10 in just a month.

It became clear there was a problem, and it needed to be solved. We took it really seriously: we went to the product owner and made it clear that we needed to understand why we were having such issues with performance. We asked to postpone all releases for a month and use the time instead to try to solve the problem.

Possible causes of performance issues

We started to identify reasons that could theoretically explain why performance was affected

The number of DOM elements on the page. The table is made on the DOM. Lots of data means lots of DOM elements on a page. We assumed we were running into Angular’s performance.

Workload has a compact mode that doubles the number of cells in the table:

There are 49K DOM Nodes on this page in compact mode, and this isn’t even the largest number we’ve seen. Some users use 4K monitors, and they have 10 times more information in the table.

Each small cell in the table is a div and, in some cases, their styles are recalculated on the go.

There can be a lot of them:

There are almost 50K here. Sometimes there have been 70K.

We assumed the component was starting to slow down because it’s difficult to render such a high number of elements.

Advanced features get slower as the amount of data increases. We tried to make the product as user-friendly as possible and introduced unnecessarily complicated features: Infinite Scroll, optimistic calculation, online updates. As the amount of data increased, these features started to work even more slowly than before.

There was a lot of computation on the front-end side. With the increase in the amount of user data, it became too much to render and process.

One of the most popular scenarios is recalculating the duration of a task according to the user’s calendar. When a user moves a task from one day to another, we draw a special placeholder with the estimated duration of the task and indicate its start and end date. In this case, the user’s calendar is arbitrary: two work days/two days off, shortened days, part-time working schedules and half-days. All this must be taken into account when “dragging” a task through Workload, and adding or removing groupings is also done on the front-end side.

These were just our assumptions. We still didn’t really know what was going on with the users. Everything had worked more or less fine on test accounts, if only a bit slower.

Action plan for solving the problem

To understand what was really happening, we drew up an action plan for the month.

It looked like this:

  1. Understand what’s happening with clients.
  2. Improve their lives within a month with a series of quick fixes.
  3. Consider the architecture changes that will be implemented in the medium to long term. Include them in the plan.
  4. Change the architecture so that the increase in the amount of user data doesn’t affect the performance of the component.

How did we know that everything was slowing down on the client’s side? Support tickets are usually quite brief: they contain a minimal description of the problem, a few screenshots and a video, if you’re lucky.

For every escalation, customers got calls from product managers and we started getting into it. It was necessary to understand what exactly was happening for each client: what data they had in the table, how much of it there was, and how it was distributed between the main table and the backlog.

On our test accounts, it’s hard to guess what a real user might think of in their table. For example, one customer with a car service breaks up a car inspection into small subtasks: change the oil, unscrew the left plug, unscrew the right plug, and so on. He has about ten MOTs every day, and each checkup is made up of a hundred tasks.

To get an objective assessment of how much our application actually slows down, we started using Apdex — Application Performance Index system.

Performance tracking and the Apdex system

Many readers are probably familiar with tracking. This is what analysts do all the time. For example, a user clicks on the A button and the tracking event goes to the analytics system. Analysts see that the user clicks on the A button often, but does not click on the B button at all. We used this type of tracking to see how long custom actions take, effectively creating our own performance tracking.

As long as the user’s picture is static, everything is fine. But as soon as the user starts doing something in the table, they wait for the action to be completed within a reasonable amount of time. If the action takes longer, it frustrates the user.

We decided to measure these actions and get the time intervals for each: the initial page load, project opening, task movement etc.

To achieve this, we sent two tracking events. One for when the user performs the action:

{“event”:”row__expand”,”datetime”:”2021–05–01T17:40:43.758",”group”:”performance”,”value”:{“workload_id”:”81451",”members_cnt”:1,”zoom_level”:”dayDense”,”tasks”:6,”grouping”:”jobRoleGrouping/user/project”,”performance_uuid”:”972438e0-aa93–11eb-a8f9–557874958202",”performance_timestamp”:1619883643758,”performance_event_type”:”start”},”version”:”2021–03–12"}

And one for when it has finished rendering:

{“event”:”row__expand”,”fraction”:”1/1",”datetime”:”2021–05–01T17:40:43.953",”group”:”performance”,,”value”:{“performance_uuid”:”972438e0-aa93–11eb-a8f9–557874958202",”performance_timestamp”:1619883643953,”performance_event_type”:”finish”}

From the delay between the two events, we were able to understand how long each action takes. In our example, it turned out: 1619883643953–1619883643758 = 195ms.

We then asked ourselves the question: is 195 milliseconds a lot or normal? The Apdex system helped us to answer it.

Apdex methodology is an open international standard designed to form an objective assessment of the performance indicators of information systems. The performance indicator is a number between 0 and 1. A value of 1 means that the application is working perfectly, and 0 means that the application does not work at all.

To get the performance indicator, you need to:

  • Prepare a list of operations to be monitored: we opened Workload and started thinking about what operations users usually perform — scroll, click and display a tooltip with a task, open the backlog.
  • Set target time for each operation: what interval is considered good, acceptable and unacceptable? The actions are different, so for each you need to determine your own interval. The initial load may take a few seconds: this may surprise a user, but after opening the table, they will forget that they had to wait for 2 seconds. But if it slows down the process of moving a task, this is bad, because this is an action that the user performs often. For the initial load, a good and acceptable time is 1–2 seconds, and for the movement of the task, it’s 200–500 milliseconds.
  • Sort events by priority: high, mid, low. The initial download probably shouldn’t be given a high priority because it runs once, but constant action should be a high priority.
  • Enable the counter for measuring performance for all operations on the list and accumulate statistics.

We obtained the execution time for each operation, which falls within a certain interval:

Execution timing

Apdex is calculated using the following formula:

Add the number of events that fall within the “good” category to the number of events that fall within the “acceptable” category divided by two. Then divide the resulting sum by the total number of events.

The resulting index value is estimated according to the table:

We collected statistics on our events in Tableau:

For several months, the index value didn’t change

We found out that the index depends on the number of rows in Workload — the number of users that are displayed on the page:

If the number of users is 1–4, everything works quite quickly. Once there are over 10 users, Workload starts slowing down.

Our hypothesis was confirmed: an increase in the amount of data negatively affects the download speed.

How did we solve the performance problem: quick fixes and long-term plans

Once we figured out the reasons behind the problems and found a tool for evaluating performance, we moved on to the second point of the plan — improving the lives of users within a month with quick fixes. At the same time, we began to think about mid-term and long-term solutions, which we would use to change the architecture.

We classified the following actions as quick fixes.

We optimized the number of front-end calculations, of which there are many for different user actions. Initially, from a calculations POV, everything was written quite competently: we used caching selectors everywhere and nothing was recalculated if it wasn’t necessary. There wasn’t much room for improvement, but execution times were accelerated by an average of 10%. That, however, wasn’t enough.

We changed the logic of getting tasks into the backlog as, during calls with clients, we noticed that the backlog box freezes the browser when filled with thousands of tasks. The backlog is the bottom bar in Workload. It displays tasks that aren’t assigned to performers:

Some users with large accounts had two or three thousand tasks in their backlog, about which they themselves were confused. This happened because during the development of the component, the product owners decided that everything should be in the backlog.

Then we decided to limit the backlog only to those tasks that are within the framework of the project. This solution was successful: there was much less unnecessary information, and customers were no longer confused with thousands of tasks. We have received good feedback on this change.

We disabled the display of information in the table while scrolling (unsuccessfully). With several dozen users and hundreds of tasks, horizontal scrolling began to severely slow down. We don’t use browser scrolling but our own, because we track whether the current scroll has gone beyond the loaded timeframe.

We decided to conduct an experiment and turn off the display of information on the chart at the moment when the user starts scrolling the table. We completely cleared the DOM from all elements and users didn’t see any days, schedule, or tasks while scrolling; only a header with weekdays.

We included this new functionality for several clients. The loading speed improved significantly but the customers didn’t like it: it turned out it was important for them to see the tasks while scrolling. The experiment was unsuccessful.

We disabled background updates (also to little success). Furthermore, we assumed that the overall performance on large accounts was affected by constant background updates, which cause redrawing. We use Wrike ourselves: we have one of the largest accounts among our customers. Therefore, we conducted an experiment on our account and confirmed the hypothesis.

We didn’t roll out the disconnection of online updates to clients because of the unsuccessful experiment with horizontal scrolling. Online updates are very important to maintain data consistency. Since Workload is a core part of the product, users usually open it once at the beginning of the working day and rarely close it at all. This means the information can become outdated if not updated on the go: the actions of all users displayed in Workload lead to a change in the ‘picture’ of tasks. It was impossible to disable online updates since this is a key feature of Workload.

We began to process and draw only what the user sees. We realized that the main problem lay in the number of current elements in the DOM that needed to be recalculated and redrawn. Then we made the solid virtualization and started rendering and counting only what the user sees.

We managed to significantly reduce the number of elements on the page. Everything began to load several times faster. The speed of some operations increased 100-fold.

For medium and long-term solutions, we came up with two plans of action:

  • We prototyped and planned to translate the product from DOM to Canvas. The prototype showed that everything works much faster on Canvas.
  • We planned a departure from the horizontal Infinite Scroll. We decided to use regular browser scrolling and load data upon request. When the user hits the border of the loaded time frame, they will be able to load new data. Many similar products on the market use the same mechanism. Apparently, they learned their lesson before us.

These two solutions should allow us to solve all rendering problems over the next year or two.

Results and conclusion

After the fixes, the Apdex value changed:

We started rolling out the first experiments in mid-March. By the end of April/beginning of May, our Apdex rose to 0.80. This is not quite “good’”, but it is very close to “acceptable”.

Graph of the index depending on the number of users:

The dependence remains: the more users, the more the table slows down. But where Apdex was 0.60 in accounts with 20+ users before, it is 0.75 now.

The flow of escalations from customers has stopped: we haven’t received a single complaint since mid-March. The business is happy, the releases are unlocked. We are now working to implement medium- and long-term solutions.

The main conclusion that we made is that performance comes first. It is a simple truth: an engineer must make sure that no corners are cut. If there is a suspicion that the product will work slowly, it is the responsibility of the engineer to flag and talk about this to the business.

We hope that our case will help some readers avoid such problems or quickly find ways to solve them. Share your stories in the comments about how things got bad for you and how you dealt with it.

--

--