How We Reduced DB Load With Our Language Agnostic Point Query Cache

Published in

motive-eng

14 min readAug 8, 2022

Motive collects millions of rows of data about vehicles, drivers, departures, arrivals, and driving events — to name just a few categories. We store most of this data in a heavily accessed PostgreSQL database, which supports many types of queries and is a key bottleneck in a system that continues to scale as our company grows. In this post, we talk about how we arrived at a caching solution that reduced the daily query load on our hardest hit tables from about 100 million calls down to 5 million calls.

The approach to scaling a complex system is to identify key high-load workloads that can be solved independently. When we looked closely at our database queries, we observed tens of thousands of point queries (SELECT queries with a single equality condition in the WHERE clause) per second hitting our key tables. Such queries were growing rapidly from both our Ruby on Rails system and our Golang services. We designed a cache that offloads these queries from the database.

Our Constraints

Our main PostgreSQL database runs on AWS X1.32xlarge instances with 126 vCPUs and 2 TB memory. This is a significant configuration for a database, and has been growing quickly in size and query load. One of the techniques we used to scale it is our caching system described below.

Our caching solution was bound by five known architectural constraints:

Much of our current system runs on Ruby on Rails (RoR). We had to work within the constraints of this system, such as the model representation in ActiveRecord.
In addition to RoR and ActiveRecord, we needed a solution that works with Backburner. We use Backburner heavily for asynchronous job processing within our RoR system.
We have a fast growing Golang ecosystem. Many of these systems need access to some of the models in the main Postgres database.
We had to take traffic growth into account, so we wanted to build a system that could handle a 10x increase in load over the next few years.
Performance: Queries hitting the cache needed to be at least as fast as hitting the DB.

We are always evaluating new technologies and sought an ability to support more programming languages; therefore, we needed a language agnostic solution. For this reason we had to reject existing open-source solutions like Shopify’s identity cache, a cache for Ruby on Rails Active Record.

Our Results

The results were indisputable: Our Datadog dashboard showed the elimination of many expensive, high frequency queries. With similar results across all tables, the biggest impact we observed was on the Users table, bearing the highest load. Our cache reduced the daily query load from about 100 million calls down to 5 million calls:

Dashboard showing that our cache reduced the daily query load from ~ 100M calls to ~5M calls

Our query latency also improved dramatically. The table below illustrates the difference between queries hitting a cache and queries hitting the database. (Remote and local caches are defined further in the document).

Difference between queries hitting a cache and queries hitting the database

Our Cache Hit Ratio Averages 98.7 %

Our AWS cache.r5.large Machine Uses on Average 1.58 GiB

Our Approach

When we analyzed the query load on our DB, we saw that about 30% of our queries were point queries to some key tables. We designed our cache to offload the DB from such queries. This keeps the DB less busy and frees it up for other work, enabling the system to scale.

A quick note: our system is a two-level cache (local and remote). While our design goals are to be language agnostic, we expect that each programming language will have its own local cache implementation tightly integrated with its ORM system. In our case this is ActiveRecord in RoR and GORM in Golang. Future language integrations will follow the same template. We go into details of the RoR caching system below to illustrate the key design elements.

Cache Design

Refer to our cache design diagram below. On the very right is our PostgreSQL database. On the very left are our servers (called “K2web,” they include both Puma API servers and Backburner background workers), which instantiate ActiveRecord objects. Between them we have a two-level cache: first, a local cache that lasts the duration of an HTTP request or Backburner job, and second, a remote cache in Redis. Because ActiveRecord uses a local cache, we wanted to mimic that behavior. Looking at the cache hit ratio graph above, we see that about 30% of our total cache hits went to the local cache, saving us unnecessary network trips.

“K2web” Puma API server and Backburner background workers hitting our local and remote caches to reduce load on our PostgreSQL DB

ActiveRecord Integration

Our object models are in RoR and we use ActiveRecord as the object-relational mapping (ORM) framework. This meant that our caching system had to integrate tightly with ActiveRecord.

Our cache is available to an ActiveRecord model as an ActiveSupport Concern (a tool for including Ruby modules in classes, creating Ruby mixins). Cache instances are initialized just the way regular ActiveRecord objects are, and eviction is tied to the ActiveRecord callbacks like before_save, before_destroy, and after_commit.

The code snippet below showcases how we add caching capability to our models and define the lookup keys:

Adding caching capability to our models and defining the lookup keys

The code below sets up the ActiveRecord callbacks:

Caching API

Depicted below is our caching API’s mapping between RoR and the cache. These methods are dynamically generated by the ActiveSupport Concern, which constitutes the entry point for the caching framework.

Two Layers of Cache

Our K2web server’s requests first hit the local cache in the server’s own memory — inside the RoR system. If the rows in question are not found in the local cache, the request hits the remote cache next. If the rows are found in the remote cache, they are then copied into the local cache as well. When data isn’t found in the remote cache, the underlying database will be queried for it; upon successful return, the data is stored in both caches.

Cache Invalidation

Consider that one API request can consist of many repeated calls; that is, the same database queries are often repeated multiple times within one API request. Mimicking ActiveRecord’s internal caching mechanism, the time to live (TTL) of each locally cached record is tied to the life cycle of the request. Moreover, the local cache is thread-local and hence isolated to the individual request being served. The first time a table row is accessed, it is copied into the local and remote caches. Once the request is complete, the local cache, implemented using the RubyGem RequestStore, clears. If the record was updated during request execution, we invalidate that table row (the record) in the remote and local caches. The row in question will not be populated until it is requested again — but because we receive 1000s of requests per second, this will typically not take long!

If the record was updated during request execution, we invalidate that table row (the record) in the remote and local caches

Redis as Our Cache Datastore

We chose Redis provided by AWS Elasticache, a key-value database for our cache. Why?

We were already running multiple Redis clusters in our company, and had good operational expertise with it.
Redis had the right set of advanced features we wanted. Redis’ support for scripting allowed us to design multi-key lookups and batch fetching using single network fetches from the cache datastore.

Serialization With MessagePack

We chose a language agnostic serialization format that worked with both Golang and RoR (and any upcoming language choices). A key consideration was that the cache should be on par with or better than the DB in lookup latency; we didn’t want system responsiveness to degrade due to the introduction of caching. Thus CPU efficiency was important.

We chose MessagePack, which gave us the right tradeoffs on each of these dimensions:

MessagePack gave us the right tradeoffs on these dimensions

Design Highlight: Table Schema Hash to Facilitate DB Migrations

When a table schema changes (let’s say a new column is added), it necessitates a database schema migration from the old schema to the new schema. What should now happen in the cache?

One approach is to go through all the entries in the cache for that table, and edit the values to have the new column. The downside is that there would be a significant period where the cache is inconsistent while this update is happening.

The approach we used is to add a hash of the table schema to the row key. The server processes initiated after the database migration will start populating and looking up in a different part of the key space, whereas the server processes yet to be restarted since the migration will continue to see the data in the old key space. Eventually all the old values will stop being read and get evicted from the cache via the LRU mechanism.

Design Highlight: Multi-Key Lookup

We built multi-key lookup capabilities into our cache. To understand the benefits of our design, let’s first discuss its poorer alternative, single-key lookup.

Single-Key Lookup

Our cache key is calculated from the row ID. Using our Company table as an example, our key format uses the company ID as the key. This key maps to a Redis hash in which the key is the table-schema-hash (described above) and the value is the hundred or so columns that make up the company object, stored in a serialized msgpack format.

Thus, with a single-key lookup, a query must always look up company row data by the company ID (row ID).

However, our Company table has more than 100 columns, and our business use cases include querying by unique fields other than company ID, such as its EIN (federal tax ID). To accommodate these use cases in a single-lookup design, we would have to replicate this enormous row in a new cache entry keyed by each of these additional lookup fields.

Multi-Key Lookup

The multi-key lookup approach provides multiple ways to query the same record in the cache. Our cache has multiple keys that point to the same value. As an example, let’s consider our User table, which is keyed on user ID, but which we also want to be able to query by the user’s email or by their authentication token or any column that has a unique constraint.

User ID: The entries whose key is user ID still have as their value a hash of the entire table schema — just as they would in a single-key lookup design. So a row in the User table where the key is the user ID has the serialized MessagePack row for that user as its value.
Email: We added the user’s email to the keys in our Redis hash. The entries whose key is the user’s email have as their value a serialized hash of the user ID, thus linking the email with the user record. To get the hash of the table row, you must still look up by user ID, but this additional lookup means that you can now access the entire table row without originally passing the user ID.
Authentication token: Same as with email, we added the user’s auth token to the keys in our hash. A lookup by auth token returns the user ID, and using that then enables you to access the entire table row for that user.

This design leads to a compact layout of our cache and thus saves storage space. The rows with the additional lookup keys only store a tiny hash of the user ID, which is much smaller than storing the serialized value multiple times.

Redis Scripts

Redis provides powerful scripting capability in Lua. You may have noticed that our additional key lookups provide a flexible way to access user data without passing the user ID. However, if implemented naively, they would still require a second round trip from our server to Redis to retrieve the actual data. We use Redis scripting to prevent the extra network round trip. The request passes us a user email, and that’s the only request the RoR server has to make; we optimize the rest.

We also use Redis scripting to populate the cache, and to update the schema when we change it by, say, adding a new lookup key:

Cache Consistency = Reliability

We have discussed what happens when database content is changed by requests originating from our RoR system. The cache records are immediately invalidated and repopulated from the database only when next requested. This ensures that the cache content is always up to date — as long as the changes are initiated from the RoR server side.

But what if database content is changed on the PostgreSQL end? If an engineer logs directly into our database and changes a row during some operational firefighting, it won’t trigger our RoR system to invalidate anything in the Redis caches. While such a scenario will be extremely rare, our system must be robust against it. We therefore implemented a configurable scheduled external checker that runs multiple times a day and checks if the cache is correct, comparing its contents to those of the PostgreSQL database, and invalidating any rows in the cache that don’t match the original. We also have alerting on such corrections so that the team can investigate any systemic issues.

Language Agnostic Solution

Many of the design decisions we discussed above contribute to our system’s language agnosticism. Here we reiterate some of the key design choices:

Redis Publish Subscribe (PubSub): The RoR system “owns” the DB models in the cache, and can directly regenerate a new table schema hash when we run a database migration. We also have the convenience of being able to deploy the Rail servers whenever DB migrations are run. In contrast, our components written in Golang (or any other language) and our cache are independent systems, deployed individually. These new components are still reading this same cache, but how can we communicate that an underlying table-schema has changed to a non-RoR component?
We leveraged the Redis PubSub mechanism, which is similar to Kafka, but had the advantage of already being included with Redis. New components written in other languages can all listen for a signal that a table schema hash has changed, and then reinitialize, using the new table schema hash in their new instance.
Redis as our cache datastore: While RoR owns our models and can change them at any time, another system must be able to understand these changes. Storing our cache in Redis gave us access to the features (scripting, PubSub) that made this flexibility possible.
Serialization with MessagePack: The MessagePack serialization format is a language agnostic format. Obviously there are plenty of other formats that are language-agnostic like Json, XML, protobuf to name a few, but MessagePack gave us the right tradeoffs in terms of latency and CPU-usage. While the ActiveRecord of a database row can’t be read in any other language, our cache can talk to both ActiveRecord and any other component, regardless of its language.

Rollout

Production stability is always our top priority. With this release, we were changing something as fundamental as the source of truth for our application, with years of code on top of the ORM layer we were modifying. We therefore took two high-level precautions:

We were selective in our rollout.
We devised tools to help us in various aspects of the rollout.

Selective Rollout

We were selective in the rollout of our feature to production. We first rolled out and tested on non-production environments.

We used feature flags to have granular control on various features in the cache. As an example, we had a flag to turn on and off the local caching mechanism so that if we found a bug in the local cache, we could turn that off without a full rollback. Similarly, we had granular flags controlling each table (Company, User, etc.) that we cache in our system. That way, if we found a problem with only caching Companies, we could turn that off. Major enhancements like the includes_cache function also had a flag to control them.

An important aspect of our work was to identify key callsites that led to DB load and move those over to using cache. We did this iteratively.

Tools for the Rollout

We focused not only on monitoring and alerts, but on building tools inside and around our cache solution in anticipation of a playbook that, at that point, had never been exercised. This meant devising tools to help us understand and validate cached data, ensuring that we had the means to invalidate a portion of cached data, or all of it, if needed.

All this preparation allowed us to enable cache with confidence, because we had the means to rollback without requiring a full release. As the cache system gained flight hours in production, we removed some of these settings, leaving only a subset for the future.

Learnings

These were our takeaways from our language agnostic cache design journey:

Tools: We learned to build tools that help validate data integrity in production. These include Redis scripts that populate, examine, and delete the cache, as well as a verification script to compare data in the database with that in the cache and invalidate incorrect data.
Using APM to identify key callsites: APMs are a great tool for understanding a complex system. It led us to the most important DB query callsites that our cache could offload the database. As you might expect, the query load was quite top-heavy with a few key callsites accounting for the majority of the load.
We then added metrics in Datadog to monitor cache performance, so we could see what happened to the query rate after we pushed our changes.
Feature flags: We learned that building switches (or feature flags) was a good way to keep production safe. It allowed us to turn off features in case of problems, to test different approaches, and to debug. For example, we built in a flag to “turn off local caching” and a flag to “turn off vehicle caching.” We recommend having these in place prior to launch.
ActiveRecord: We learned that ActiveRecord is complicated! We learned to treat integrating with ActiveRecord APIs as critical libraries. We learned to benchmark functions like find_cache, try_cache, and includes_cache to ensure speed.
Driving adoption: Once available, we needed to make sure the new methods were adopted hereon by the larger engineering organization. In a globally distributed engineering team, it was a challenge to get everyone to use method cache_x instead of method x” consistently. New callsites introduced in target flows that didn’t respect our new cache impacted cache efficiency and DB performance. This led us to devise a custom Rubocop cop to make the new methods mandatory and preserve cache efficiency. For this purpose we created a few custom Rubocop Cops that fail any builds introducing new usages of the former methods, helping everyone at Motive stay consistent with the new API.

Acknowledgments

This project and this blog post were a joint effort by Daniel Baez, Dominic Hayes, and Gautam Roy. We thank Daniel Desmueles and Harsha Lokavarapu for their major contributions to the project.

Come join us!

Check out our latest Motive opportunities on our Careers page to learn more about our rad engineering team.