How CQRS improved the loading time of a very slow web page

Published in

Jamf Engineering

6 min readMay 4, 2023

Jamf Now has been growing very quickly — not only in terms of the number of customers, but also the size of customers using it. Recently this success led to a classic scaling challenge. Some of our biggest customers started experiencing slow load times on a page in the console called the device summary page. The satisfaction of our customers is our highest priority — addressing the problem became one of our top priorities.

Understanding the issue

To better understand the issue, we gathered data about the device summary page loading time. The tenants were split between three categories depending on the number of devices — small (less than 10 devices), medium (10 to 499 devices), and large (500 or more devices).

For the biggest customers, our analytics showed loading the device summary page could take up to 1 minute or more. Not good.

The unusually long loading time was the result of the number of tables queried when fetching the data for the page, including joins between different tables needed to calculate device status. The problem was compounded by the fact we are not using data pagination. This was a deliberate decision as it allows us to support quick device filtering on the UI side — our UI always fetches the state of all devices from the server. It’s fast and effective once all data are fetched from the server to the client and provides a great UX, however it also directly contributed to the performance issues on the server side.

Our solution

Our main goal was to improve the device's summary page loading time, whilst retaining the same slick ‘quick device filtering’ UX. Given that as a product we had decided to go in the direction of a microservices architecture, the decision we made was to create a new microservice responsible for storing device-related status data: service-device-status.

We decided to use the CQRS (Command Query Responsibility Segregation) pattern and store aggregated device summary documents using AWS DocumentDb. Using the CQRS pattern would remove the need for slow joins when fetching the data and so should dramatically speed up the queries.

The aggregated document is created and modified on the basis of events coming from different parts of our system. Many of them were already there, but we had also identified a lot of additional events needed to complete our goal.

With the new events, there was a need to store more data in the main application database to have some reference when the specific event should be sent. We placed the procedure of filling in missing data as one of the steps in the migration pipeline.

To understand if our microservice was actually improving performance for customers, we defined a Service Level Objective (SLO) — 99% of customers should be able to load the device summary page in less than one second.

No downtime migration

As our service is new and contains new storage with different data schema, before we can switch tenants to the new solution we had to fill the database with data of existing tenants (events come with information about data changes, but needed to generate a baseline of the tenant’s device data before we started to handle the events). We achieved this by migrating data from current tables and services to our new database.

To avoid Jamf Now downtime and provide the best user experience we implemented additional solutions that let us perform migration in a smooth and unnoticeable way. In particular, we used feature flags to be able to dynamically change which endpoint (the old one from the main application or the new one from the service-device-status) is used by a specific user.

Before migrating all the tenants, we engaged with those customers that had experienced the heaviest issues with the device summary page performance. During that time we were monitoring the situation using metrics and logs to handle appearing issues. As a result, we got feedback on whether the new solution improved the situation.

Results

In the screenshot below, the statistics for the old application are shown on the left in the (mostly red) table and gauge charts.
The statistics for the new application are shown on the right in the (mostly green) table and gauge charts.

The old application does not achieve our SLO — only 74.58% of customers can load the device list endpoint in less than one second, and the 99th percentile longest request took over 28 seconds.
The new application achieves this for 99.98% of customers — achieving the SLO, and the 99th percentile longest request is now only 153 microseconds.

The line graphs in the middle show how those described values change over a 4-day period. device-status is the new microservice and smb-frontend is the old application.

In the next screenshot, you can see the same statistics but only for large customers (500 or more devices). These large customers saw the biggest benefit from the new microservice. Again, statistics for the old application are on the left and statistics for the new microservice are on the right.

The difference between the old and new solutions is huge. With the old application, no large customers were able to load the device list in less than one second. And only 2.37% were able to load it within 3 seconds. The median load time for these large tenants was 25.8 seconds.

With the new microservice, 99.6% of large customers are able to load the devices list in less than one second, and 100% of large customers can load it within 3 seconds. The median load time is now 145ms.

The difference between the previous and the new solution is huge. The service improves device summary page loading for large tenants by 2 (or in some cases even more) orders of magnitude.

This is a result of architecture change — effectively device status is not calculated on demand, but at the moment when it is changed. The device status is kept in an aggregated way and there is no need for any calculation when the service is asked for it.

Summary

The situation was a more complex challenge than it seemed to be at the beginning. It has been running on production for all customers for a few months now and its metrics stay very good and are still within the SLO we previously defined — there is a huge difference between the previous and current versions of the feature. The journey with the summary page is closed, but we still have many opportunities to use the service-device-status to improve performance in other parts of the application.

If you would like to follow our steps here are some key points you should consider:

establish your SLO’s — your goal should be clear
identify your bottlenecks and find ways to eliminate them — in our case, we switched from calculating the status on HTTP request to updating the status when it is actually changed
when you find your solution check what resources would fit best i.e. we decided that to store device status the document database will do the job