Ac(count)ing for Scale

Published in

strava-engineering

9 min readApr 26, 2023

Tens of thousands of API applications interact with Strava’s publicly available API, from small projects whose only users are the developers who created them to larger partners like Garmin, Zwift, Wahoo, or Peloton, who upload thousands of activities to Strava daily on our athletes’ behalf. Recently the API & Platform team undertook a project to redefine the way that Strava supports API applications and allows them to interact with Strava data (read more about the updated Developer program here).

As part of the effort to redefine the management of API applications, a main requirement of the project asked that we be able to assign and enforce a limit to the number of athletes who are allowed to authorize an application on their behalf, what we called an application’s athleteCapacity. On creation of a new application, the default athleteCapacity is a single athlete — the application owner. Developers can subsequently apply to have their limits increased.

In order to enforce the assigned athleteCapacity we needed to know how many athletes had authorized any given API application to read and write data to their Strava account, what we termed as an application’s connectedAthletesCount. When a new athlete attempts to authorize an application on their behalf (as pictured below), we wanted to be able to ascertain whether the authorization would cause the connectedAthletesCount to exceed its athleteCapacity.

The prompt shown to an athlete to request their permission to authorize an external application to connect to Strava on their behalf. We planned to perform the check of an application’s `connectedAthletesCount` against its `athleteCapacity` on click of “Authorize”.

If authorization would cause the application to exceed its athleteCapacity, we would prevent the athlete from authorizing. Otherwise we would allow authorization to proceed. Our use of the OAuth2.0 standard for external access to our API gave us an easy proxy for determining the connectedAthletesCount. As part of the OAuth flow, an external application receives an access and refresh token after an athlete has authorized it to connect to Strava. The application must use its short-lived access token to obtain and modify Strava resources, and its refresh token to obtain a new access token when the current one expires. Since each application only ever has a single refresh token per athlete, we intended to compute the connectedAthletesCount for an application by counting the number of unique athlete IDs that had refresh tokens associated with a given application.

The Challenge

While we often think of counting as elementary, providing accurate counts at scale can be technically challenging. We store refresh tokens in a table in a SQL database. Counting N rows from a SQL table is at least as expensive as reading N rows with an index and becomes increasingly untenable as the number of rows to read and count grows. Some of Strava’s largest API partners have connectedAthletesCountsthat are in the millions, so computing the count on the fly in response to every potential athlete oAuth event would be a no go.

Possible Solutions

We knew that we would need to store the denormalized connectedAthletesCountfor each application to keep reads cheap. We explored two possible approaches for what this would look like from a technical perspective, both of which would leverage existing Kafka pub-sub architecture in Strava’s backend and rely on a new consumer to asynchronously respond to refresh token creation or deletion events.

Approach One: Maintain our own denormalized counter
In this approach, a consumer of refresh token creation or deletion events would increment or decrement a counter in a data store, keeping it up to date. Kafka’s processing guarantees would be important in this approach; because our counter would be “live”, processing a message multiple times or failing to process it at all would result in inaccuracies in the count that would grow over time. “Exactly once” processing is hard to achieve, however, so we’d have to process our messages with “at least once” processing rather than “at most once” processing, the latter of which does not guarantee delivery. To avoid the possibility of double counting if messages were to ever be replayed — as they can be with “at least once” processing — we could partition the stream of token events by application. As order is guaranteed within the partition, we could then store the offset of the last message processed for that application next to its count in our data store. When processing a new message we could compare the offset of the incoming message with that of which was stored, and only increment or decrement the count if the incoming offset were greater.

Approach Two: Recompute the count probabilistically based on the known connectedAthletesCountProcessing guarantees would matter less in the alternative approach we considered. Instead of incrementing or decrementing a counter, the consumer of token events would rerun our expensive count query of refresh tokens probabilistically: if the connectedAthletesCount for an application were beyond some defined threshold, we would become less likely to recompute the count the larger the count became. Anytime we’d re-run the count, we’d replace the value of the connectedAthletesCountin a data store.

Probabilistic counting would mitigate against the increasing cost of count queries as the number of rows to count increased. Fewer expensive queries, however, would come at the expense of accuracy, as the count would more likely be stale for applications with a larger connectedAthletesCount. For our use case, however, this would not be too critical; we determined that we cared more about maintaining a fresher count for smaller applications so that we could more accurately enforce their athleteCapacity. Applications with the largest connectedAthletesCount, on the other hand, would have the most permissive limits, making stale values less important.

Decision Time

As we compared the two approaches, we also considered what backfilling a baseline count for all of Strava’s API applications would entail. Maintaining a denormalized counter would necessitate a more complex backfilling approach, as we’d need to snapshot the count at a known point in time and write that total to a database. We’d then have to play future events to it to capture any token events that occurred during the duration of the computation and the write. Adopting the probabilistic approach, on the other hand, would require a much simpler approach; if the deployed consumer were to receive an event for an application that lacked a connectedAthletesCount, it could compute the count irrespective of the application’s size.

Given the relative simplicity of the implementation of the probabilistic approach over maintaining our own denormalized counter, we decided to give it a try, so long as we were able to optimize the performance of the count query for our largest applications.

Implementation and rollout

Our first order of business, then, was to ensure that our count query would be performant; otherwise we would have to abandon the denormalized approach in favor of maintaining our own counter.

To assess query performance, we made a copy of the database that holds the refresh tokens table, as we wanted to avoid running our tests of long-running queries on a critically important live production database. To our copy we added new covering indexes that we hoped would be leveraged by and improve the performance of the count query that the consumer would initiate probabilistically in response to refresh token creation or deletion events. Running some test queries on a few of our largest applications revealed a query execution time of ~16 seconds with the new indexes.

Although 16 seconds was at least an order of magnitude slower than most SQL queries performed at Strava, we had several reasons to think that the probabilistic approach would still be tenable. The fact that the potentially slow count query would occur asynchronous to the refresh token creation or deletion processes themselves would be critical; as these events would be initiated by an athlete end user, a long running query would make for a poor UX were the count computation synchronous. Also important was the fact that by using the default MySQL transaction isolation level, “repeatable read”, our count query would not lock the queried rows for other database sessions, a key requirement on a table so central to Strava’s functioning. Finally, while we were unsure of the consequences of running a count query like this while serving other authentication traffic, we planned to leverage instrumentation and tuning parameters (detailed below) that together would allow us to modulate the frequency of counts as we sought to balance database load and freshness of the connectedAthletesCount.

With increased confidence in our approach, we then added a Kafka producer to emit events on refresh token creation and deletion as well as a Kafka consumer to read the stream of events. During event processing, the consumer fetches the application to which the event pertains, probabilistically performs the count query, and finally updates the application’s connectedAthletesCount. The code snippet below illustrates the logic we put in place to determine whether to perform the recount; bolded are configurable protections that gave us greater control over the rollout:

def shouldCount(application: Application): Boolean = {
  if (enabledApplicationIds.contains(application.id)) {         
    application.connectedAthletesCount match {
      case None => true
      case Some(count) if count < largeApplicationSize => true
      case Some(count) => random.nextInt(count / 
      connectedCountUpdateRatio) == 0
    }
  } else {
     false
  }
}

The use of a parameter called enabledApplicationIds enabled us to initially gate recomputation to only applications that we specified, allowing us to test our code on a variety of application sizes in production before turning it on all for all applications.
For any of our enabledApplications, if the connectedAthletesCountwas not set, we automatically computed the count so that we’d have a value going forward (described above).
We added a configurable variable called largeApplicationSize; any application with a connectedAthletesCountsmaller than this value always has its count recomputed. Counts at this size and below are inexpensive enough to perform every time a token event occurs and also ensure that the connectedAthletesCountfor the applications whose limits we care more about enforcing are up-to-date.
connectedCountUpdateRatio, another configurable variable, allowed us to fine tune the probability of recounting. With a connectedCountUpdateRatioof 10 and a connectedAthletesCountof 10,000, for example, we generate a random number between 0 and 1,000 when we called nextInt(10000 / 10). A 1/1,000 chance of obtaining a 0 translates to a 1/1,000 chance that we recompute the count. To make the probability of recomputing the count more or less likely, we could increase or decrease the connectedCountUpdateRatio.

We also added instrumentation to our code so that we’d have visibility into the latency of our handler and the rates of computes, no-ops, and errors amongst other metrics. Monitoring of our metrics and logs revealed that the initial deployment to production and tests using the enabledApplicationIds parameter failed with timeouts in the server’s connection to the database — an unsurprising result in retrospect, given that we expected the longest count query to take ~16 seconds, much longer than the default timeout threshold. After bumping the timeout, we were excited to see the connectedAthletesCountproperty populated for the applications that we had specified in enabledApplicationIds on our internal applications admin UI; our consumer was working!

Display of the `connectedAthletesCount` on an internal applications management UI.

When we removed the enabledApplicationIds parameter to allow for computation for all applications, we turned to fine-tuning the connectedCountUpdateRatio.Out of an abundance of caution we had set the initial ratio value to be small, resulting in an absurdly low frequency of recomputes and prolonged data staleness for applications with a connectedAthletesCountgreater than the largeApplicationSize. To derive a more appropriate connectedCountUpdateRatio, we estimated how often we wanted these more expensive recomputes to happen for an application of a particular size, looked at approximately how often token events were received for such applications in a given period of time, worked backwards to arrive at a suitable ratio, and made the correspoding changes.

Continued monitoring of database metrics throughout the rollout and tuning process allayed our concerns about our count query having a negative impact on load and ability to serve other traffic; we didn’t see any significant changes to server latency or database CPU. We then moved on to the final steps of the task, which entailed building out fullstack functionality for comparing and enforcing an application’s connectedAthletesCountagainst its athleteCapacity limit during authorization events.

The page rendered during the authorization process if the authorization event will cause an application’s `connectedAthletesCount` to exceed its limit.

Conclusion

Denormalization is one way to address the inefficiency of performing on-the-fly counts at scale in a SQL database. To compute and store a denormalized connectedAthletesCounton API applications at Strava, the API & Platform team weighed two different approaches that captured trade-offs between accuracy and implementation complexity. In our use case, the fact that larger applications tended to have the most permissive limits to their athleteCapacity meant that stale counts for our largest applications were less important. The use of tuning parameters enabled us to toe the line between data freshness and database performance. Engineers facing similar counting challenges will have to weigh the importance of these trade-offs for their own use case.

Huge thanks and kudos to Jeff Pollard, Yudi Fu, and Collin Neuhaus for their contributions to solution design and implementation.