A Scalable Reporting Infrastructure at Motive

Published in

motive-eng

12 min readApr 14, 2022

Motive processes data from many sources, among them Vehicle Gateways used to monitor vehicle locations, driver safety, and other activity, and provides customers with reporting features on their fleets. Our backend is a Ruby on Rails (RoR) monolith, which hosts not only our reporting service but most of the processing for the many features we provide to our customers. As we extract business capabilities from the monolith, we begin with the reports feature. Read on to learn how we migrate reporting from OLTP to OLAP, reduce the computational load on the backend, and provide report customization ability to our customers.

Background

Motive provides essential reporting features to our customers so they can get a clear picture of their fleet through the data we collect from vehicle gateways, dual-facing dashcams, or Asset Gateways. We host our reporting service on our monolithic Rails backend, where we also process API requests, scheduled tasks, and asynchronous jobs.

Currently, a great majority of our reports are served via API and on-demand. Although our Rails backend is very scalable, not all reports need to burden the underlying relational database and can be processed in a separate data pipeline. In fact, queries that join a few thousand rows can result in tens of seconds of latency.

We are in the process of extracting business capabilities from the monolith. Splitting up the monolith into separate services has several benefits including faster deployment, better fault tolerance, and greater autonomy for teams.

The reports feature is an obvious candidate for extraction from the Rails monolith for a few reasons tied to the product requirements of reports. The larger data volume over longer time periods and serving historical data are strong motivators to using our data warehouse as opposed to the Rails backend.

We aim to do the following:

migrate from a transactional OLTP database to an OLAP database
reduce the computational load from the Rails backend
unlock the capability for customers to customize their reports

OLTP to OLAP

Our Rails backend uses a PostgreSQL database. PostgreSQL is conventionally classified as an OLTP (online transaction processing) database. OLTP queries typically access small amounts of data or few records. OLTP databases are designed to serve the most up-to-date application data.

Our data warehouse uses Snowflake, an OLAP (online analytical processing) database. OLAP database characteristics include aggregations over many records, including historical events or less recent data. This type of data is suitable for business analytics.

Our reports which aggregate historical data are ideally suited to our data warehouse as opposed to our Rails backend because the data freshness requirement of our historical reports is less strict.

Reducing Computational Load on Our Rails Backend

Keeping reports in the Rails backend created computation contention with existing features that have a higher requirement for minimizing delay, such as Fleet View, a dashboard that provides a geographic overview of a company’s vehicles and assets.

Breaking reports off the monolith and into our dedicated data platform gives us more control over compute allocation and analytics, leading to improved scalability.

Custom Reports

In order for our customers to self-serve their own reports, we selected Tableau as the best candidate. It features a simple drag-and-drop interface for building reports, beautiful visualizations, a strong community, and customer support services.

We licensed our Tableau Server cluster for 16 cores, which we provisioned into 3 nodes (1 initial node and 2 additional nodes) because clusters with fewer than 3 nodes don’t support any failover or high availability.

We implemented a ticket broker service to extend our Rails backend authentication to Tableau, which provides an authentication token to be used against the Tableau Server.

Summary

We set off on our journey with an architecture in mind:

Adopt Snowflake, our data platform’s OLAP data warehouse solution.
Process our data using TableAPI, our homegrown data platform’s ETL tool, which supports SQL, PySpark, and pandas.
Surface this data to our users with Tableau, which we will embed into our own web interface.

In addition, we were warned that if Snowflake couldn’t service queries to our latency requirements, we may need to drop a PostgreSQL instance between the BI tool and Snowflake. This turned out to be a valuable warning; thankfully, we acted on it early.

Our first attempt to produce such a data pipeline report was the “Historical Dispatch Summary” report. There are far more historical dispatch records than active dispatches, and the data freshness (up-to-date) requirements of this report are more lenient than those of our more responsive features.

Dispatch Summary Pipeline

Dispatch and Dispatch Summary Report Background

A dispatch is an abstraction to describe a vehicle delivering freight from a shipper (start) to a consignee (finish). There can be multiple enumerated dispatch stops per dispatch. The shipper, consignee, and dispatch stops all have associated dispatch locations (geographic information). Fleet administrators or dispatchers create dispatches and assign them to drivers and vehicles. The status of a dispatch is indicated by a dispatch form object. When the dispatch forms are incomplete, they indicate that the dispatch stop is still in progress. Naturally, when the dispatch forms are filled out or completed, they indicate that a dispatch stop is completed.

The dispatch summary report aggregates dispatch information into actionable information for fleet administrators.

Examples include:

What percentage of dispatches are successfully completed?
Who are the most punctual drivers?
Who are the least punctual drivers?

Illustration of the dispatch summary report

Data Pipeline Overview

The data pipeline starts when a dispatcher creates a new dispatch. This dispatch data is persisted in our PostgreSQL database. Then, a Fivetran (a data platform service) connector syncs the data from PostgreSQL to Snowflake. From there, periodic TableAPI (Pyspark, pandas) ETL jobs transform the data as necessary (for example, averaging, or splitting data into buckets) and populate another Snowflake table.

In our case, the data takes one more downstream step into a PostgreSQL table, because queries from PostgreSQL have faster response times than directly from Snowflake. The Tableau Server then queries Postgres and performs additional transformations. Finally, our front end consumes from Tableau Server and populates the report pictured above.

The two below figures illustrate the transformation from our previous reporting workflow to our new dispatch summary workflow.

We encountered multiple challenges during our implementation. These challenges are categorized broadly as (1) data model challenges, (2) query latency challenges, and finally (3) challenges embedding Tableau into our web interface.

Data Model Challenges

Although data pipelines have clear design benefits, when your data resides in a data warehouse, it loses the power of the web framework’s ORM (object-relational mapping). In our case, that was ActiveRecord.

Typically, a web backend developer takes for granted the tools that enrich their tables with associations and methods. A data pipeline developer sees their world from flat tables. As a result, implementation has to bridge this gap. We needed to restructure two outstanding data models for analytical purposes: the dispatch model and the group model.

Dispatch Model

The Rails data model for dispatches is very normalized and contains many methods to determine the dispatch object’s attributes or state. For example, whether a dispatch is started, in progress, or completed depends on the state of a dispatch form entry object. This is distinct from the progress of a dispatch stop that has its own series of forms. Because it would be very onerous to reimplement this logic off flat tables in Pyspark, we simply persist these attributes in PostgreSQL, even though the Rails application doesn’t make use of them. In the data pipeline, we’ll recover the data and process the transformation without the complex logic of joining together all those tables.

This hurdle has an important attribute: the historical dispatch data can be processed without additional filters. In the next section, we address the challenges of filtering by groups.

Groups

Groups are abstract entities that polymorphically group together entities in our data model. Groups implement authorization. Users in a group have access to the other entities in the group. For example, a group can classify zero, one, or many Users, Vehicles, etc. Groups are polymorphic, so they can really contain any entity (any table), as defined by an entity’s primary key and entity type. The typical query is something along the lines of: “return rows from the dispatch summary filtered by these groups.”

In Rails, having the full power of the ORM, along with ActiveRecord model associations and scopes, enables us to navigate groups and their associated features reasonably smoothly. The data pipeline must reimplement the groups filter to maintain feature parity.

To start off, we need to establish that the groups parameter could be zero, one, or many groups. Supporting one or many is straightforward: parse the groups parameter, resolve the groups into a list of drivers and vehicles, and join against the dispatch summary table.

However, if no groups are passed, then the implicit inner join results in no rows returned. We need to reimplement the join and short-circuit the join for cases where no groups are passed in. Below is an example of this filter:

Query Latency Challenges

We always expected Snowflake to introduce latencies we had not experienced with PostgreSQL, given the nature of their roles (OLAP versus OLTP). However, optimizing the query latency was more effort than we originally anticipated.

Snowflake provides some very nice analytical tools to examine query latency: the query profile interface, and the information schema table. We used the former to address per-query latency, and the latter to aggregate the latencies and verify strategically that the changes we were making were having the desired impact. We categorized query latency into queued latency, compilation latency, and execution latency.

Snowflake Queued Latency

Our Snowflake platform was initially configured to process dispatch summary queries alongside other ETL queries. As a result, there were long queuing delays sometimes adding minutes to the query latency.

Snowflake is designed to scale compute and storage separately. Provisioning a new data warehouse lowered the queued latency. Pictured below are the “before” and “after” graphs of the various delays (queuing, compilation, execution) in comparison. The queuing delay evidently makes up less of the overall delay.

Queuing delay (pictured in green) BEFORE

Snowflake Compilation and Execution Latency

Using the query profiler, we identified the compilation and execution bottlenecks that were contributing, in some cases, tens of seconds to our query latency.

One example of execution delay was a table scan against the users table that was subsequently joined against our dispatch summary table. In Rails, our users table is associated with our companies table via a join table. As a result, the foreign keys from companies and users reside in neither table (they reside in the join table instead). The solution is to use cluster keys.

Our investigation into the benefits of clustering started with a new users table, which included the company ID foreign key populated with a join between the users table and the users-companies join table. Snowflake organizes data in micro-partitions as the data is inserted into the table, which may result in an inefficient number of micro-partitions. To force clustering, we sort the rows. Snowflake is designed to partition the table automatically. Sure enough, we reduced the partitions for the new users table to about 16, in contrast to ~60 (a ~75% reduction).

Although the table scan was no longer a problem, the queries still took longer than 3 seconds. Our data was still normalized across a couple of data models in Snowflake requiring multiple joins. We couldn’t bring down the compilation delay any further without remodeling and flattening the data more. Changing the data model further was at odds with the business requirements of our dashboards.

At this point, we hit the lower bound of compilation and execution latency achievable by Snowflake. This delay was unacceptable for powering our user-facing dashboard.

Fortunately, the entire time we were conducting these experiments on Snowflake and working with Snowflake support representatives, we were also investigating a downstream PostgreSQL database.

Using PostgreSQL to Serve the Data Instead of Snowflake

The benefit of aggregating our data in Snowflake and then serving the report from Postgres is that we get the big-data processing capability of Snowflake along with the low query latency of Postgres. Instead of relying on Snowflake clustering, we can use multiple Postgres indices. We simply need to propagate the aggregated data into Postgres and have Tableau query, as a data source, from Postgres as opposed to Snowflake.

Snowflake View

The first step in the interface between the Snowflake table and the downstream Postgres table was for the Postgres pandas ETL job to consume from a Snowflake view and not the normalized tables as replicated by Fivetran. This would reduce the complication of the resultant Postgres query from something like this:

To something like this:

In addition, this would reduce the number of tables propagated to just the essentials. Anything that was dynamically queried, such as groups, would need their associated tables to be propagated as they were volatile and could change at any time.

Optimizing Query Performance

Our initial tests on our staging environment were positive. The PostgreSQL database was serving the Tableau queries and powering our frontend at acceptable speeds (1 second on PostgreSQL as opposed to ~3 seconds on Snowflake).

However, our project shortly experienced some scope expansion and we used Postgres EXPLAIN to identify a query that supported a very expensive join. We solved this by having the frontend compose this data from the Rails server via API calls, instead of having the pipeline query generate it. Here, the benefit of processing groups via our Rails API outweighs the cost of coupling back to the Rails server. The update brought our delay from tens of seconds to around one second.

Embedding Tableau Challenges

Our initial investigations into integrating Tableau reports into our web interface presented us with a spectrum of options. On the one end, we could directly embed the Tableau report. On the other end, we could use Tableau’s APIs to construct the report data and implement our own formatting.

The trade-off for the simplicity of embedding the reports directly into the web interface is that the Tableau report styling conflicts with our existing UI design (the styling customizability offered by Tableau didn’t suffice for us). Using Tableau’s APIs did offer increased flexibility for data and visualization formatting, but at the cost of having to build our own custom UI interfaces.

In the end, we ended up with a hybrid approach. Where we needed the most flexibility, we custom-built the components in Angular and D3.js. Otherwise, we used what Tableau had available. In both cases, we used the Tableau JS API to fetch and filter the report data into the web frontend.

Conclusion and Future Work

Despite all of the challenges we encountered in our journey to separate from the monolith, we were successful in developing a more scalable pipeline to process our historical dispatch summary report. Our pipeline processes ~100k rows every 6 hours. At the lowest granularity of one year, our report aggregates ~750k rows’ worth of data into a few views. By contrast, even joining against a few thousand rows requires tens of seconds of API response time on our Rails backend. However, our ambitions do not end here.

We’re looking forward to experimenting with alternative implementations, such as using Tableau data extracts as a substitute for the entire downstream Postgres database and pipeline. Tableau data extracts periodically query and store a copy of its data source. User-initiated queries are read from the copy as opposed to going to the data source on demand. This trades freshness (because the copy is only periodically updated) for speed, as the entire denormalized table is available. A drawback of this solution is that the Tableau cluster needs to be scaled up. Although the Tableau server is horizontally scalable, we didn’t find it to be very cost-effective for our team.

Another implementation to explore is denormalizing the Snowflake data earlier in the pipeline. Instead of denormalizing at the Postgres table, we could join together the dispatch summary data as a part of the main dispatch summary ETL. Our initial report pipeline ambitions were to keep the data normalized so that end-users could join together tables of their own choosing to produce their own custom reports in Tableau. Tableau has a powerful drag-and-drop interface for visualizing the joining of tables. Joining the data as a part of the ETL would likely mitigate Snowflake execution delay. In our future work, we plan to measure the delay improvements by statically joining the report data. In addition, Snowflake is constantly improving query latency with better caching. Our next efforts will benefit from these improvements.

Come Join Us!

Check out the latest Motive opportunities on our Careers page to join our rad engineering team.