How to Deal with API Quota Limits While Minimizing Risk: A Caching Approach

Arthur Swieckowski
Ordergroove Engineering
5 min readAug 8, 2022

How to Deal with API Quota Limits While Minimizing Risk: An Approach to Caching

Problem Context

Ordergroove has an integration with Shopify, one of our core partners in eCommerce. This integration requires syncing of data models between the two systems. Since oftentimes there is no clear one-to-one relationship between the systems’ entities this syncing requires careful translation between domain models, which I’ll refer to now as sync orchestration.

Ordergroove takes responsibility for sync orchestration in both directions, that is, translating operations from Shopify to Ordergroove and from Ordergroove to Shopify. However, performing queries and mutations through the Shopify API is not free, and utilizes a limited but recharging quota limit. As the Shopify merchants utilizing OrderGroove continue to grow, so has the number and density of operations requiring sync orchestration. We have inevitably run into API throttling that could negatively impact our merchants and their users.

The question is then, how can Ordergroove optimize its use of Shopify API queries or mutations?

*There are numerous differing Ordergroove operations and Shopify Webhooks

Ordergroove is subscribed to the relevant operations through webhooks, and proper sync orchestration almost always necessitates querying the Shopify API. When an operation is triggered from Ordergroove, both Shopify API queries and mutations must occur for proper sync orchestration.

Shopify quota consumption is therefore mostly query heavy. And though some mutations can potentially be optimized through clever refactoring of order and grouping, the benefit will be marginal and the implementation differing wildly on a case by case basis.

Simply put, optimizing mutations is relatively low impact and high in complexity.

On the other hand, optimizing querying is lower in complexity and higher in impact (the same process occurs for syncing in both directions and can be easily generalized to many different operations that rely on the same entities).

Specific Problem Statement
Ordergroove needs to somehow optimize its Shopify API querying.
But how? Caching!

In the most naive sense of the term, caching is saving something for future use.

Simple, right? But our specific problem occurred at scale, and there are already many existing dependent systems and operations, so our cached data absolutely cannot be stale or otherwise risk incredible negative impacts on the merchants and customers that rely on us.
However those numerous mutations that I mentioned before are all vying to make our cache out of date.

We need to make sure our cache is either always busted or constantly updated. But how can we be sure that there isn’t some operation in our large and robust system that we’re missing when we make that switch? We can pour over every line of code, but human error is always a risk, especially when trying to comprehend and account for many fine details in a system. There may be some key operations that are simple enough to list, and those might even comprise 80% of all the operations. But searching for that last 20% is high effort and prone to error.

Therefore, our real problem becomes: Ordergroove needs to optimize Shopify querying with a cache while minimizing the risk of utilizing that cached data in its sync orchestration.

Solution

Ok. One straightforward approach to minimizing the risk is to tackle just a portion of the problem at once. Perhaps just focusing on one entity to cache at a time. But we also want this entity to be important enough (queried very often) to alleviate our quota problem. Ah, a Catch-22 in risk management: balancing the risks of current problems with the risks of their solutions.

How did we manage that Catch-22? We embraced that our initial attempt would have failures and edge cases that we didn’t account for, but removed the impact that those failures could have.

How? We implemented the cache and prevented staleness where we knew it would occur, but we didn’t actually use the cached data in our operations. Instead, we compared the results of every read from the cache to the result from the Shopify API query and logged whenever there was a discrepancy.

Solution Process

This approach is not only a pattern but also an iterative process. We had to beef up our logs so that we could easily investigate which operations were causing the discrepancies and resolve them one by one.

We transformed what was originally a high risk change that would have to be perfect out of the gate to escape issues, to a data driven and iterative one that improved the system over time. The data driven aspect really makes a huge difference, because where before risk and impact were black boxes, now our production environment (not just some test suite we wrote ourselves) was telling us where, when, and how we fail.

From there, the process took care of the rest. Watch for discrepancy alerts, investigate and categorize issues, and resolve them. Rinse and repeat until you don’t have any more discrepancies. Or you’ve reached a number/type of discrepancies that you’re comfortable with; you have the data so you can accurately do cost/benefit analyses on these decisions now!

Results

We were right about our initial pass not being perfect. There were a bunch of small operations we failed to account for that could have still had an impact on our users. However, there were also oversights in places we thought we had covered. For example, we made sure to bust our cache on a key calculation that mutates a core entity, but this operation was asynchronous and could take quite a while. Initially this would create a cache miss and force a read from Shopify, however this would cache that result, and if another operation managed to slip in at just the right time (love race conditions), it would be working off the stale cache data instead of the updated result from our async calculation.

Our approach allowed us to identify and resolve problems such as this one without ever introducing any risk or impact to our system. Once we were confident the implementation was ready, we removed the comparison system. The difference was staggering, for example, the merchant with the greatest quota consumption issues went from a minimum available quota of 0% to one of 90.31%. Very similar improvements were seen across the board without a single regression for our merchants and customers.

--

--