Sometimes, we work on shiny new features that (we hope) athletes will notice and love. Other times, we need to work on back-end improvements that are unlikely to be noticed at all, but are usually just as important as the other things we work on. Some recent work I did for our club leaderboards fell into the latter category — unlikely to be noticed, but really important for the scalability and reliability of strava.com.
The Scalability Problem
Our club leaderboards were originally built circa 2013, and have remained largely unchanged since then. In some sense, this is a good thing — it’s a testament to the quality of the original implementation that our code was able to withstand more than eight years of user growth with few code changes. But more recently, problems have begun to emerge. Back in 2013, we wrote code to treat clubs with 10 athletes, 100 athletes, and 1,000 athletes differently. As this implies, clubs were originally designed for small communities and groups of friends with tens or hundreds of members. Today, our largest club has over two million athletes. The way the product is used has evolved, but the way our code handles this scale hasn’t (at least, not yet). Clubs in 2021 sometimes have many more members than we envisioned when we originally built the club leaderboard back-end in 2013, and we recently began to see scalability problems with the original implementation that could affect its performance and in rare cases even affect site-wide performance.
Because club leaderboards weren’t designed for huge clubs, the way our back-end systems calculate club leaderboards was rather simplistic. If you’re familiar with clubs on Strava, you’ll know that clubs have leaderboards for this week and last week, sorted by total distance or total time. The app shows the top 10 athletes plus the viewing athlete, and the web interface simply shows the top 100 athletes. To calculate these leaderboards, our back-end would run an SQL query for every activity done by every member of the club this week (which would then be grouped by athlete, summed, and sorted). This was an Active Record query in our Rails app, and the generated SQL was essentially this:
SELECT * FROM club_members cm
JOIN activities a ON cm.athlete_id = a.athlete_id
WHERE cm.club_id = ?
AND a.start_date_local >= ? AND a.start_date_local < ?
With proper indexes, that query is reasonably performant for relatively small clubs (under about 100 members, say). But with larger clubs, the performance degrades quickly. For every additional athlete in the club, the query has to examine many more rows (all of the additional athlete’s activities) to filter based on the start date. And there’s a fundamental problem here — every time any athlete in the club uploads or modifies an activity, the entire leaderboard becomes invalid. And when it’s loaded again, the existing code will use the above query to rebuild the leaderboard — recalculating the data for all athletes in the club even though most athletes’ data remains unchanged.
Around 2015, we must have begun noticing some scalability problems with that query, because our git history shows a commit where we added a very simple caching mechanism. If the club had more than 200 members, we would simply cache the results of that query for up to 6 hours at a time. This seems like a quick-fix solution that was probably very beneficial for improving the loading time of the leaderboard (at least in the average case), but didn’t necessarily fix the root problem — it just covered up the symptoms with a cache. Clubs with thousands of members would of course load the leaderboard much faster with the cache mechanism (though the trade-off for this was that the data might be up to 6 hours out of date). Unfortunately, the cache didn’t really fix the real root problem — the database query that didn’t scale well with club size was still being used.
Today, Strava’s largest clubs have millions of members. For clubs that large, the query above can take up to about 20 minutes to run (at which point the results will be cached for 6 hours, so the leaderboard would still appear to load quickly). User experience aside, the worst problem from an engineering perspective is the load that this query puts on our database. Other, seemingly unrelated parts of our website and app could experience degraded performance while the database was busy trying to fetch the activities for a club leaderboard. We had a couple incidents within the last year or so where we briefly experienced high error rates (HTTP 500s) across the site because too many of these queries were running simultaneously.
We developed a plan to replace the leaderboard back-end implementation (fixing the performance problems and eliminating the risk of a site-wide incident) while leaving the front-end mostly unchanged. We wanted to tackle the performance problems quickly to eliminate the risk of further incidents related to this problem. To do so, we needed to develop a way to build and store the leaderboard that would scale better than the existing query for large clubs. The scalability problems were primarily caused by the fact that the old implementation queried for activities from all the athletes in the club at once and recalculated the entire leaderboard cache any time there was a single change. To avoid this problem, our new back-end implementation allows club leaderboards to be updated on a per-athlete basis. That is, we maintain the state of the leaderboard, and when an athlete uploads (or edits or deletes) a new activity, we only have to modify that athlete’s score — we don’t need to rebuild the entire leaderboard because a single activity changed.
Of course, this means we need to persist every club’s sorted leaderboard for both weeks (this week and last week). We chose Redis as the storage platform to use here. It’s a good fit because Redis sorted sets provide a way to maintain sorted data (like a leaderboard) and retrieve ranks quickly, and because Redis provides a way to expire the data when it’s no longer needed (after two weeks have passed). When an athlete uploads (or edits or deletes) a new activity, a background job is triggered to update that athlete’s club leaderboard data. We query once for that athlete’s activities in the past two weeks. Even in the worst case (athletes who work out multiple times per day), this query is very fast. Once we have the relevant activity data, we update the athlete’s score in our leaderboard data, and Redis keeps the set sorted (so the athlete’s position is always correct). In Redis, the members of our sorted set are athlete IDs and the score is the athlete’s weekly distance (or another stat, depending on the type of leaderboard).
When you change the back-end of a live system, data migration is usually a major concern. How do you get data into the new system while keeping the old system running? In this case, we were lucky because we were able to mostly avoid the problem with a dark launch. We created two new feature switches. One switch turns on writes to the new club leaderboard back-end, and the other switches reads from the old system to the new one. By enabling writes to the new system (first for employees, then for 10% of users, then for 100%), we were able to load test writes on the new system in prod and populate it with data while users continued seeing data from the old system. Below, you can see our Redis storage increase as we rolled out writes to the new system.
After writing leaderboard data to (but not yet reading it from) the new system for two weeks, we switched on reads to the new system (which by then had complete data). This was exciting because we were able to see our response time metrics improve as we rolled out the new code. Because the old implementation often hit the cache, but occasionally missed the cache and recalculated the leaderboard during the request, this was most visible in our P99 request metrics. We saw minimal improvements or no change in average response time, but saw a 4x improvement in P99 response time.
When you’re developing software, there’s often tension between competing interests. There’s always more work to be done than can possibly be completed. There are bugs to be fixed, new features to be implemented, and scalability or infrastructure problems to be addressed. All of this, essentially, is a question of prioritization. When we planned the club leaderboard work, we decided up front that it was worth investing six weeks of developer time (with a small team of one developer and one part-time code reviewer) to tackle this problem, but we didn’t want to take more resources than that away from other priorities. Planning this time commitment at the beginning of the project helped us architect a solution that was achievable within that time frame. Six weeks was long enough for us to implement a really solid solution to the core problem (by completely redesigning the club leaderboard back-end) rather than just putting more band-aids over it.
But because we knew our time was limited, we cut some scope to ensure we could finish in the allotted time. We would have liked to build our new club leaderboard solution as an isolated back-end service in Scala, but we weren’t confident that the additional amount of work required could be completed within our six week timeline. So, instead, we built the solution in our existing club leaderboard codebase in Rails, and we ensured that it was well-encapsulated so it would be easy for us to extract the logic to a service in the future. We also made a few other small compromises in our designs. For example, we thought about implementing a relatively complicated backup and restore solution for the leaderboard data in Redis, but ultimately decided that would be overkill (since the data expires after two weeks anyway and regenerates itself as activities are uploaded), and a relatively standard Redis cluster with read replicas would be reasonable given our time constraints. By cutting some scope from the project, we were able to complete the work in a way that didn’t divert too much developer effort from our other priorities, but still addressed the root scalability problem in a timely manner.
Overall, the project was a big success. We achieved our primary objective — to reduce the risk of a future incident because of database load caused by the expensive club leaderboard query, and to do so in less than six weeks of work. In addition, we improved the athlete experience by ensuring that our club leaderboards work quickly and correctly — even when the club is a large club. We swapped out the old, bad back-end implementation for a better one, and we did so with no downtime and little changes to the existing front-ends. Hopefully, our new system will be easier to work with and modify, and will last at least as long into the future as the old one did!