Iterating on Club Leaderboards
Iteration is an important part of my development workflow, and it’s an important part of the way we work at Strava. Over the course of my own career, I’ve learned to really value the process of incremental development. By shipping relatively small changes quickly, we can gather feedback, observe important metrics, and continue the cycle with targeted improvements. Ultimately, this helps us continually deliver athlete value on a rapid timeline.
Earlier this year, we improved our club leaderboard backend infrastructure to scale for millions of athletes. At the time, we were laser-focused on shipping a meaningful scalability improvement quickly because the old code put us at risk of overloading our database with its slow query patterns. That project was a big success. It made our large club leaderboards 4 times faster and led to similar relief on our database load. Of course, as you might guess from the title of this article, that was just the beginning. We recently completed more development work around clubs that built on the scalability project from earlier this year.
From a technology perspective, we knew we took a shortcut in our previous project by implementing the optimized leaderboard code in our monolithic Rails codebase (a platform we’re working to migrate away from). To iterate on that, we re-implemented a similar club leaderboard solution in a Scala backend service and leveraged an event-driven architecture (design patterns we’re moving towards).
From a product perspective, we’d developed a strong desire to support additional types of clubs. For a long time, Strava had supported only running, cycling, and triathlon clubs. That was a design decision from very long ago, and until recently we hadn’t done much work on clubs. Our top clubs feature request was a request to support additional types of clubs.
Jon, we hear you.
Backend Service Designs
When we initially scoped this project, we had two goals; one was rooted in modernizing the technology that backs club leaderboards, the other in adding value to the athlete experience:
- Encapsulate the club leaderboard logic we wrote earlier this year into a Scala service.
- Support additional sport types for clubs, beginning with walking clubs.
At Strava, we have an engineering design review process where we share documents about upcoming projects with other engineers to get feedback on the approach before jumping straight into code. One of the outcomes of this design review process was an additional goal:
- Use our newer event-driven activity and stats services rather than relying on SQL database queries to update the leaderboard data in response to new activities and updates, thereby improving fault-tolerance and scalability.
We designed a system that meets all 3 goals. The first goal, to encapsulate the club leaderboard logic into a service, was relatively straightforward. We originally imagined this as porting some code from our Rails monolith to Scala. Because we knew that we might want to port this code to Scala when we wrote it in Rails, the Rails code was already well-encapsulated with a nice interface. As it turns out, we weren’t just porting code — we also had to write some new logic in support of our third goal, to make better use of our event-driven architecture. But the logic for storing leaderboard data in Redis didn’t change much, so we saved some effort by not redesigning anything from our previous optimizations.
Our second goal was to support additional activity types in clubs. From a technical perspective, it actually isn’t too hard to support arbitrary activity types on the backend — the implementation is nearly identical no matter what activity types we support since our stats service can already calculate stats for any activity type. Of course, the frontend implementation is much more complex. Should we allow groupings of multiple sport types? Should we suggest groupings of sports that make sense together? What columns should we show on the leaderboard for each sport type? How do those columns change if multiple sport types are selected? The list goes on and on, and gets more complicated as you think about the different user interface designs (on mobile and web) that might be involved with various options. To some extent, we put most of those questions aside for now. This project was focused primarily on improving our backend implementation in preparation for additional frontend changes to come at a later date. Still, it was important for us to ship some tangible improvements and to begin experimenting with what future improvements might look like. So in this iteration, we decided to build a new, more flexible backend, and make some minor frontend changes to support a few additional club types — starting with walking clubs. In future iterations, we’ll leverage research and analytics to gather feedback about walking clubs so we can continue making targeted, iterative improvements to the clubs experience.
Our third goal was to use our newer, event-driven activity stats service to replace the SQL queries we were running in our Rails monolith to update club leaderboards. In the Rails monolith, new activity uploads are processed in a background job. Our early-2021 optimizations to club leaderboard performance used that background job to query the activity database for an athlete’s activities and populate the leaderboard based on those activities. There are some risks associated with that approach. In particular, if Redis (our leaderboard store) isn’t available when the activity is uploaded, we don’t want to fail the upload process, so we allow the leaderboard write to fail without failing the entire upload process. Therefore, there was a chance our leaderboard data could miss updates and become incorrect if Redis had a temporary outage. Aside from the potential for missed updates, it would be an improvement to the system if we could reduce the load on our activity database by avoiding those queries that fetch an athlete’s activity data to rebuild club leaderboards every time an activity is uploaded. Using our stats service to track an athlete’s weekly stats addresses both of these concerns.
At a high level, here’s how our new backend architecture fits together. When an athlete uploads an activity, we store data values for that activity (time, distance, etc.) in our activity data service. This service publishes an event via Kafka to let downstream services know that activity data was created (or modified). Our stats service listens to those events and publishes its own events to let downstream services know when stat values change. Our activity data service and stats service have both been in use for a couple years already. The club leaderboard service we’re building listens to stat events and keeps the club leaderboards in Redis up to date in response to the stat values provided in the events. A big advantage of this design is the event-driven nature of the updates. The event-driven system reduces our database load (because most data is included in the event). It’s also more fault-tolerant — if, for example, our Redis leaderboard store wastemporarily unavailable, we wouldn’t miss any leaderboard updates. Our club leaderboard service would fail to process any Kafka events, but it would keep retrying until Redis was available, and its Kafka consumption would resume exactly where it left off.
Rollout with Dual Writes
In this project, we implemented a new backend service from scratch, and we needed to make a plan to roll out the new service safely. As with most new services, our biggest concerns were around load testing and data accuracy. Fortunately, the foundation engineering team at Strava provides great tooling for creating new services. We can observe our service with many useful metrics in Grafana, and we have robust feature switch tooling to help us control the rollout. All of this creates a positive, supportive environment to build and release new software.
We first rolled the club leaderboard service out to production in a “dark launch”. During this phase, the club leaderboards service was listening to stat events and writing leaderboard data to Redis in response to those events, but nothing was reading from our new Redis leaderboard store yet. In other words, we used a dual write approach where we had two separate Redis stores (one managed by the Rails monolith and one managed by the new service), and we continued reading from the old one while testing the new one. This dual write approach simplified testing for us because we were able to see our new system in action and under load without fearing negative impact if it didn’t work properly. We began load testing the new system by writing to the new data store for 1% of all athletes and observing the effects. Then, feeling more confident about the load and the data accuracy, we rolled out writes to 100%. In this case, we preferred a single big jump over a slow increase so that any problems would be immediately obvious (and we could quickly go back to 0% if any problems arose). We didn’t run into any load problems when we rolled out dual writes — our systems were able to handle everything just fine!
Lurking Issues
While the initial rollout of dual writes was very smooth, we did encounter some unexpected issues a couple weeks later. The club leaderboard service uses the stat service to track club leaderboard values, and stats have a start date and end date. The club leaderboard service runs a scheduled job to create stats for upcoming weeks and delete stats for past weeks. As it turns out, the deletion of stale stats caused some unexpected load on one of our systems.
For every stat it tracks, the stat service stores the stat value for each athlete it’s tracking in its own database. This allows it to quickly lookup the stat value for any stat for any athlete at a later time if another service requests it. While that’s useful in general, the club leaderboard service doesn’t really need that access pattern — it relies primarily on data from stat events received via Kafka. And in this case, supporting the lookup access pattern caused unnecessary load because the stat service had to delete all the stale stat value rows from our database when it cleaned up the stale stats (which happens every week). When we deleted our first week of stale club leaderboard stats, the stat service deleted nearly 40 million (!) stat value rows from its own database, causing our stat cleanup job to take nearly 48 hours. Fortunately, this didn’t have any major negative impact on our systems, but we did want to avoid doing all this unnecessary work on a weekly basis. We implemented a relatively quick and simple fix that allowed us to configure the stat service not to store athlete values for club leaderboard stats in its database (though of course it would still calculate them and provide the updates via Kafka). This was a good solution since the club leaderboard service didn’t use the lookup access pattern anyway, and it allowed us to avoid the unnecessary database operations and made our cleanup job fast.
Full Rollout
We rolled out several different code changes and feature switches over the course of several weeks to release all the changes necessary for the new clubs functionality. Overall, the rollout went smoothly, but it wasn’t perfect. Even the best engineers are humans, and some amount of mistakes or bugs are inevitable. We briefly introduced a bug in our club creation code due to a race condition in the new service, but we were able to quickly address the problem and we released a fix the same day. It’s good that we were able to fix the bug as fast as we did. And all things considered, it’s a success to roll out a big backend change with only a minor bug that was quickly fixed.
One of the last things we had to do as part of our rollout was begin reading from the new club leaderboard service in production. Once dual writes had been enabled for 100% of athletes for at least 2 weeks (to populate a complete set of leaderboard data), we used a feature switch to make our code read from the new service instead of the old club leaderboard store. This change was transparent to athletes — they didn’t notice any difference because the data was identical, despite being served from the new system. But the fact that we were using the new system meant that we were now running code that could support walking clubs, and we enabled another feature switch a few days later that allows athletes to create walking clubs on Strava. Mission complete!
This round of development work built on the club leaderboard optimizations we made earlier this year. But we’re not done iterating. The work we just completed to move club leaderboards into their own service provides better support for more activity types and unlocks many new possibilities for us in the future. We plan to continue gathering feedback, learning, and iterating, making incremental improvements as we go. Intentionally practicing iteration as part of our development process has helped us ship new things quickly while continuously striving to improve, and we hope it can do the same for you. Keep iterating!