How We Improved the Performance of Our Reports

Published in

motive-eng

16 min readMar 1, 2024

At Motive, we collect gigabytes of data daily from our vehicle gateways, asset gateways and camera offerings. There are a multitude of ways in which we use that data to power services to make a difference in our customers’ lives. One of the basic requirements for most of our customers is a comprehensive set of reports. Over time we’ve created dozens of different reports to provide insights into fleet usages, locations, histories, geofences, fuel usages and more. Reports were created in an ad hoc manner, each one nearly independent of the others. Creating a new report was a combined effort between the frontend and backend teams, with lots of back and forth establishing contracts and functionality resulting in a longer process than desired. Along with slow development, as we’ve been gaining more of a foothold with larger enterprise-size fleets, a report that performed adequately for a fleet of under 100 vehicles could struggle with one that had over 500.

With a goal of easing development cost of new reports as well as improving the overall performance of all of our reports, we began taking steps towards a faster future.

Initial State

Quick tech stack information — we are primarily a Ruby on Rails shop, we have some microservices created in Golang and use PostgreSQL as our primary data store. Our front-end consists mainly of Angular. We also offer custom reporting using Snowflake.

Each report was treated as an individual page in the frontend and an individual API endpoint which contained its own code and logic that returned the requested data. As we added more and more reports to our product, the development and maintenance of these reports became quite cumbersome, and classes were becoming huge with custom code everywhere. The first step we took towards improving development time was to drive some of the frontend via data stored in the backend, this would allow the frontend to create reusable components that would work across all reports, eliminating duplicate code in the frontend and creating consistent contracts across all reports.

Report Definitions

We created a wrapper object around each report which we call a Report Definition. This contained some necessary information about the report. The main things were a unique key to identify the report to the frontend and a route to tell the frontend which API endpoint to call to grab data. With those two things alone we could now serve up a list of all reports to the frontend as they were now all defined and stored in the database. We could use this object to also start adding additional vital functionality. Authorization was handled in the frontend, it loaded up the users’ role and company settings to determine if you could access a particular report. This worked but could run into issues if we needed to access reports not via the frontend as the logic would need to exist in multiple places, for example if we wanted subscription support. A report subscription or scheduled report allows us to set a cadence in which we’d like to generate and send the report (pdf or csv) via email to a list of stakeholders. To enable report subscriptions we created another layer on top of report definitions. Since the report was defined in the backend now we can trigger it with a cron job and know which function or endpoint to call to render the data while respecting the authorization settings for that report. We could also store user-specific states for a report, which filters and what sorting they had set so they can come back to a report and not have to set it all up again. Subscriptions and saved states were not possible without storing the report definition in the database. It also allowed us to think about putting rendering information into the definition as well, allowing reports to be almost completely controlled from the backend meaning new reports could be added without any work required from the frontend team.

Sequence diagram for driving reports from the backend.

Augment with filter and table definitions

Most of our reports are tabular in nature, that is they’re mostly columns and rows. They also tend to share similar filters, usually dates and entities (vehicles/assets/groups). We decided on a json schema that would move all configurations for reports into the report definition model. Filters, Summaries (aggregated counts) and the table itself can be defined via json. The end result is that the frontend has some common components that are used across all reports. Their functionality is loaded via the json config that gets pulled from the backend API. Now to add a new report you don’t need any frontend changes, you simply need to create a new report definition. This has reduced the amount of code in the frontend and lets us quickly create new reports.

As we’ve been migrating our reports to this new framework, we also started looking more deeply into the performance of these reports. First, however, we needed to investigate what the main performance issues were across all our reports.

Getting Informed

The first step in looking at report performance was knowing how our users were consuming their reports. We use Datadog to monitor API traces, metrics, and errors. We also use it to create dashboards and monitors, our go-to resource for identifying APIs with high latency or error rates. Using the data we collected we’re able to get some performance numbers for our report APIs as well as some other interesting tidbits of information in terms of how our users are using them.

These insights helped drive the initial ideas on improvement possibilities. We explored caching, that is storing the dataset in memory so it’s quickly fetched without hitting the database, significantly. In our case, this wouldn’t have helped us much as very few people went to the second pages of a report. In addition, with filter options and date ranges caching would only really work for default values of a report, unless we cached all possible filter and sorting options, which would be a lot of cache entries, once a user added a filter or modified the date range we’d have to pull in new data.

Investigation

With the help of our Datadog dashboard, we prioritized our worst performers and began a deeper dive into individual reports to see what was impacting performance. There was a common subset of problems that were impacting the majority of the poorly performing reports.

Pulling data from different sources
Grabbing all entities even though they have no data for the selected time range
Default filter settings are causing us to suffer
Sorting on columns that have been transformed or translated in the frontend
Unoptimized queries
Summary calculation for each page
Filter spam calls

Pulling data from different sources

We have reports where we pull data from multiple sources (e.g., Postgres and DynamoDB) and combine them in memory to create the final dataset. This can be slow given that you generally have to merge datasets in memory to produce the desired results. If you need to filter and sort on a merged dataset, there’s no easy way to do this at the database layer and instead, we have to pull in and merge the entire dataset before we can sort, filter and return the proper result set.

Grabbing all entities

In many of our reports, we are pulling in all entities (drivers, vehicles, etc) even if they have no pertinent data for the report. For example, we had a Vehicle Inspection Report which lists all users and the inspection reports they made in the given time range. However if the user had made no inspection reports in that time range, they are still fetched, returned and displayed in the report. It could be argued that this can provide some value depending on the use case of the report. For example, if I wanted to see who didn’t have any trips for a week or which vehicles had no fuel usage. However, we could argue then that we should split the report up into two different reports, one for details about the entities that have data for the time range and one for the entities that do not. Many times the query to grab all the entities that have no data is taking up the majority of the time as there are more entities that do not have data for the requested time range.

Default filter settings

This is mainly an issue for filtering on entities, such as vehicles/drivers, and not as big a deal for report specific filters like an entities’ status (e.g., active, deactivated). Currently, there are no default filter settings and we end up asking for all entities.

The default time ranges vary (they’re usually tied to HOS (hours of service) cycle settings) and we have no restrictions on most reports for how large your date range can be. This results in some clients asking for data for the past year which, if it’s an export, can reduce the database speed to a crawl.

Sorting on columns that have been transformed or translated in the frontend

If we are displaying translating values of columns on the frontend this poses challenges when sorting, as the sort value will not be the value stored in the database and would need a way to have a custom mapping between the translated value and the value stored in the database.

Here the database stores English jurisdictions that we pass to the frontend (sometimes the backend translates it and sometimes the frontend does) and they display the translated jurisdictions. You can see if you sort on the translated Spanish values they have a different order than the English values that we store in the database resulting in incorrect sort order.

Possible solutions are described in the sorting & filtering section below.

Unoptimized database queries

There were at least a few reports that had unoptimized database queries where some N+1 queries were happening. Those needed to be fixed on a case-by-case basis.

Summary Calculations

In many reports, we have a summary table which usually contains counts of certain values we’re looking at. For each page of the report, the same queries are fired repeatedly even though the summary counts are the same for each page. However, this is only an issue for < 5% of requests (since users rarely look at pages beyond the first one) and so it actually isn’t a really big issue. Time should be spent trying to optimize elsewhere.

Filter spam calls

The filter controls for multi-select will make an API call to get report data after every selection. This resulted in some confusing logs in the datadog where the same user will hit the report API multiple times within a few seconds and each subsequent hit would have greatly reduced performance, sometimes even resulting in server errors. This would occur when the query is already expensive, the database is essentially executing the same query before it’s even finished the first execution. There are no partial results being cached and a large amount of memory is consumed holding these large datasets in the database’s memory causing major degradation in the database performance.

Less, is better

One of Motive’s core values is Less, but better. In the world of report performance, less data is better. The less data you need to query, manipulate, return and display the quicker your report is going to load. There were some immediate quick wins we took here.

We reduced the default time range across all reports to be 2 weeks
Reduce the maximum allowed time range to a quarter.

These two changes immediately and drastically improved our latency numbers. Many reports had initial time ranges of a quarter and this would query too much data, most of it couldn’t even fit on a single page. By limiting the maximum time range to a quarter, we prevented some users from asking for years worth of data, which had detrimental effects on our database.

Paging

Most of our reports are paged, that is we don’t return the full dataset in a single API call. Instead, we break it down into pages and return only a single page of information at a time which is then rendered in the frontend. This is a common practice in almost all APIs. There are some reports in which we were not doing this when we could have been.

We have a geofence report (see here for information about our geofences), which will tell you what entities were in a particular geofence and for how long. This report was not paged likely because most of our clients don’t even have 50 geofences (50 is our default page size). However each geofence itself could contain hundreds of entities so even though we’d only have 25 geofences, we’re loading information for potentially thousands of entities. Paging was added with a default size of 5 geofences for this report and it immediately improved the performance.

Another example was our distance summary report which is useful for IFTA reporting purposes. It displays a list of each jurisdiction a vehicle traveled in for the period of time selected, along with how much distance was covered and fuel purchased within it. Initially, this report was not paged in the database but in memory as we wanted to be able to sort on some of the values across the entire dataset. However, we also group the jurisdictions within a vehicle object, so the sorting only happens within a vehicle. This meant that we could actually page on the database level and only pull out a page size of vehicles and their associated IFTA information instead of the entire fleet at once.

Unless you are required to pull all data into memory for some reason, paging should be implemented at the DB level on every report.

Sorting and Filtering

Nearly every report has some ability to filter on some of the data, also sorting on columns that you’re reporting on. This doesn’t have to be a performance issue, but it certainly can be. There are a few things to be aware of.

We restrict the columns that can be sorted and filtered. Filtering and sorting on columns that are not indexed in the database can have performance implications. Sorting can have its own issues in that certain values could be translated for different locales in the frontend and stored in a different value (the default language, English) in the database. The easiest solution for this is sorting on the entire dataset in memory and then returning the sorted page, however, this is far from a performant solution. A better option was creating a SQL helper function that will sort with a custom column ordering which allows us to sort on translated columns at the DB level.

Restrict API Calls

We restrict the number of public API calls allowed by a single IP with a bottleneck service, this helps prevent malicious (DDOS) or accidental degradation of our system from spam API calls. From within our own application the same thing was happening, we were spamming our internal APIs. The solution we implemented here is debouncing. Debouncing is removing unwanted input noise from buttons, switches or other user input. So when clicking on the filter values they would like, we don’t fire the request for data until x amount of time has passed, avoiding repeated calls bringing back data that the user doesn’t want. Ideally, here we have a debounce value that is custom per report as each report query can vary in how painful it is to execute.

Multiple Data sources

The approach we take when having to deal with data from multiple sources is to treat a datasource as the primary datasource upon which you can filter, sort and page on and then augment that datasource with data from the additional data sources. This at least will allow you to perform some of the operations at the database layer. If this approach is unavailable then the options described ahead may be of use.

Further Optimization Options

If you have done everything you can to reduce the amount of data being fetched and are still having some performance issues, there are some other options that are available to you.

Upstream processing

At Motive, we have a great team of data platform engineers who manage our data ingestion and exposure. They have built up a great tool called TableAPI described here. The basic premise is to pre-compute your data before you report on it, this way you don’t need to pull from multiple data sources or do intensive computations in memory. There is a slight delay caused by pipelines running at a predetermined cadence, i.e. every 6 hours, so any data we’ve collected since the last run will not be present. If that isn’t a deal breaker then this is the recommended option if you need good performance.

Skip the ORM

Most of our reports are built within our Ruby on Rails application in which we use the ActiveRecord to make our day-to-day lives easier. ActiveRecord, or any ORM (Object Relational Mapper), makes the mapping between tables in your database and the models in your application nearly seamless. There is a cost involved due to hydrating (the act of creating the object out of the data in the database) the models to interact with them. When dealing with writes, it can make your work a breeze, simply modifying properties and saving the object will persist changes to the database. However, since we’re looking purely at reporting, we are reading these objects and in most cases, simply asking them for their underlying data and usually not all of it. The benefit of hydrating these objects isn’t there anymore.

Let’s test out this theory and see if it holds.

Pure object loading

Benchmarking was done with the benchmark gem.

def benchmark_loading_with_orm
 time = Benchmark.measure do
   driving_periods = DrivingPeriod.joins(:vehicle).where(:vehicles => {:company_id => 160}).starting_between(Date.parse("2022-01-01"), Date.parse("2022-05-01")).order(:distance).limit(50).offset(50)
 end
 puts time
end

def benchmark_loading_via_sql
 time = Benchmark.measure do
   driving_periods = ActiveRecord::Base.connection.execute("
     SELECT driving_periods.* FROM driving_periods
     INNER JOIN vehicles ON vehicles.id = driving_periods.vehicle_id
     WHERE vehicles.company_id = 160 AND (driving_periods.start_time BETWEEN '2022-01-01' AND '2022-05-01')
     ORDER BY driving_periods.distance ASC LIMIT 50")
 end
 puts time
end

To eliminate the effect of caching we ran the following commands before one of these functions

arion restart — restart the database container to remove any postgres level caches
bin/spring stop — to stop spring caching

Benchmarks

So using ORM is about twice as slow as using pure SQL from a time perspective.

We also ran the above using the memory_profiler gem.

Retained: long lived memory use and object count retained due to the execution of the code block.

Allocated: All object allocation and memory allocation during code block.

Memory consumption wise, it takes about 2x as much memory to execute the query and about 1.4x as much memory to just keep the DrivingPeriod objects in memory.

Report without ORM

We picked the Driving Event Details by Group report as this report suffered from some sql performance issues. For testing we created two versions to test with, one was sorting and paging in memory and the other was sorting and paging at the database layer. We would test both variations, once using the ORM and then again bypassing it and using SQL, so a total of 4 test cases.

We hit the endpoints with 2 different queries, one for 6 months and the other for a year. These are both larger time ranges than what we will permit going forward but will help illustrate some memory and time differences.

We sorted on distance asc which is an unindexed column on the driving_periods table.

While SQL is faster, as long as we sort and page in the db layer, it’s still fairly performant using ActiveRecord and probably not worth switching to pure SQL.

For reports that require loading the entire dataset into memory, SQL is the better option for both speed and memory consumption.

Metrics

Before beginning this work we investigated a few of our worst performing reports and took a snapshot of their loading times before we began this optimization work. We use latency percentiles to analyze our reports performance. Here’s a before and after look at the performance increase, pre-optimization values were taken with a week time window and post-optimization with a two week time window:

Vehicle Inspection Detail

Geofence Inventory

Trips Detail

Form & Manner Error Details

Logs without DVIR

Distance Report by Vehicle

These results speak for themselves, anywhere from a 5x to 20x improvement in load times.

Summary

Reporting performance is always challenging. No matter how much you optimize it seems that there’s always requests for which it could be faster. Many of the issues were easily dealt with some minor fixes to drastically improve the performance of most of our reports quickly and fairly effortlessly. Your experience may vary. To prevent us from repeating the same mistakes we created a reporting checklist:

Report only on entities that have data
Keep default time range low (e.g. 2 weeks)
Maximum time range is a quarter
Sorting is done in database layer and on indexed columns
All translations (e.g. language translations for strings) done in backend
Paging done in database layer
No N+1 queries
Set default values for filters if possible

When pulling data from two or more sources:

Have a primary source and use it to filter, sort and page. Then grab additional data from the other sources for only the page_size number of entities.
Create your data upstream using TableAPI or roll ups
Use SQL directly instead of the ORM to fetch your data