Rebuilding the Segment Leaderboards Infrastructure — Part 1: Background
Over the past year, the Strava platform team has been working to rebuild the segment leaderboards system. This is the first in a series of four blog posts detailing that process. This first post aims to give some context and background on the segment leaderboards system, analyze previous systems for their design characteristics, and detail some themes and other high level technical challenges the leaderboards system presents.
Since the Beginning
One of the earliest Strava concepts was that of a Segment: member-created and edited portions of road or trail where users can compete for time. When a Strava user rides, runs, swims, or in any way traverses a segment as part of an activity, during activity upload processing we calculate the time it took for them to traverse that segment. This segment “effort” is then placed on a “segment leaderboard,” ranked against all other efforts by all other athletes who have traversed that segment.
Aside from the “overall” leaderboard, which contains all athletes who have attempted the segment, Strava also provides leaderboards on various other dimensions. These include:
- temporal features (effort occurred today, this week, this month, this year)
- athlete age group (i.e. 25–35 years old)
- athlete weight class (i.e. 125lb — 149lb)
- athlete gender (male, female, unknown/unspecified).
Beyond leaderboards based on these static features, users can further filter efforts to make “dynamic” leaderboards containing people they currently follow, or people who are members of a particular club. Users can also see a leaderboard of just their own efforts ranked against each other as well.
Segment leaderboards are one of Strava’s most popular features, and something we and our community finds large value in. They provide our users a rich set of comparisons on various characteristics, allowing Strava users to quantify their athletic performance, track it over time, and compare their performance against others.
Leaderboards Infrastructure History
Being such an old, core, and popular feature of the Strava product, the segment leaderboards infrastructure itself is also somewhat archaic. Over the years, as the number of activity uploads increased, architectural deficiencies with the leaderboards system emerged and highlighted problems with its design. Generally speaking, the pace of problems provided enough work for one engineer to work on them continuously since their creation. This engineering work has produced many system updates, or even whole new versions, in the service of improving the system performance for the future. The remainder of this section serves as a survey of this work.
The very first segment leaderboards system wasn’t really a “system” at all. Back then we had to support far fewer leaderboards and filters, and had to operate on a much, much smaller set of efforts. So a purely SQL query-backed solution, which produced leaderboards by running SQL queries against the table holding all segment efforts, worked well enough.
As the number of efforts grew, and the desire to support more leaderboards and filters increased, around 2012 a decision was made to ditch the SQL query-backed approach of the initial system, and instead denormalize the effort data into some sort of leaderboard-specific data store. The business logic for managing that denormalization would still be contained in our Rails web app, we were just looking for a more scalable way to manage and store leaderboard data.
Redis was chosen as the data store and the new leaderboards system was built, and deployed. At the time, the choice of Redis made a lot of sense. Redis is fast, provided a way to have synchronized access to leaderboard-useful data structures (i.e. sorted sets), and supported replication for failure recovery. We also had plenty of operational experience with it.
A past blog post, which detailed how we powered KOMs with Redis elaborated, on our happiness of the decision:
Did I mention yet that we love Redis? From a systems POV it has been awesome: replication is easy, memory footprint is small (we do optimize for this), and the CPU is largely bored. Today all of our leaderboards fit on a single replicated Redis box using under 16GB of memory. As our data-set grows we feel comfortable sharding it as our access patterns are already so straightforward (thankfully Redis forces you to keep things simple).
In 2014 we decided to enhance the Redis-backed solution further by extracting the leaderboard managing application code from our Ruby on Rails web application and putting it behind a Scala-powered RPC service. The actual application logic was the same, we just rewrote it in Scala, and changed it to run in a standalone service, separate from the Rails app. As efforts were added or removed, the Rails app would now issue a remote RPC call to the Scala service asking it to update leaderboards with that effort. The move to a Scala service allowed for leaderboard updates to happen in a faster, more concurrent runtime, and allowed us to easily iterate on the service code independent of the development cycle of the Rails application.
The Redis storage of leaderboard data scaled up to be quite large. At the time of its death just this past month, we were averaging 1.4 million activities uploaded per day, with 75–100 efforts added per second. To support that, we were running a 60 node Redis deployment holding 1.8TB of available memory. The Scala application server deployment also grew to between 40 and 60 server r3.large EC2 instances.
Not everything ran smoothly, though. Over time the Scala-powered Redis-backed service began to experience chronic failures, highlighting problems with the design of the system. A few strava.com outages were due to problems with the leaderboards service. By 2015, the general consensus was that work should focus on making it more stable, scalable, and easier to manage for the future.
We reviewed past decisions and problems with the leaderboards system, trying to determine from a more systemic sense where we were failing. Our findings of the research determined two main themes of problems:
Problem #1: Redis was a problematic choice for storage
With Redis, all data has to fit into memory. However, memory is quite expensive to scale. As our data set grew, it became obvious that scaling an infrastructure to hold the entire data set in memory was untenable. If we were to upgrade our machines to the next instance size up, we’d be paying north of $20,000/month for just data storage, even with a one year prepay of EC2 reserved instances.
Since memory is expensive, the system tried to be frugal with its choice, and number of, Redis data structures. Consequently, we only maintained two denormalized leaderboards as sorted sets (overall and female), with efforts for all other leaderboards held in a big Redis hash per segment. The hash was keyed by athlete (who had attempted the segment), and the values were a custom Ruby binary format (remember, this came from Rails) holding all efforts necessary to produce all other leaderboards.
Since Redis hashes are unordered, to build any other leaderboard, besides the overall or female one, we had to transfer the entire hash structure to the Scala application server, decode it, and then build and sort the leaderboard in memory. As the number of athletes who had attempted each segment grew, the expense of transferring and processing the hash also grew. For some large leaderboards, simply the transfer of the hash data across the network took 50ms, a very long time in the world of Redis.
Compounding this problem was the decision to shard leaderboard data by segment. This meant all Redis commands and data pertaining to a particular segment went to the one Redis instance which owned that shard. This often created hot spots, as certain operations produced a high volume of expensive Redis commands scoped to one particular segment. When experiencing one of these hotspots, we would often see contention or command timeout, as the single Redis instance could not handle the sudden surge of traffic.
We continued to not do ourselves any favors by making the unfortunate architectural choice to shard the dataset into the relatively small number of 100 shards. Since a shard is 1% of the dataset, it makes the shard size really chunky and awkward, making it hard to pack shards efficiently onto a single instance. Moreso, because Redis is single threaded, within a single shard there is only one network pipe available to handle all request and response operations. As the number of commands and size of responses increases, so does contention.
Finally, Redis has a generally poor high availability ops story. If a Redis instance went down, we’d have to promote a standby replica — something that would happen more often than we would like given our cluster size was so large. It’s obviously possible to automate this recovery, and the more recent existence of Redis cluster makes sharding data easier, but we questioned the utility in improving the operations of a data system we were not sure we wanted to continue with in the first place.
Problem #2: Synchronizing leaderboard writes via RPC is hard
The leaderboard system is a distributed system, made up of many server and data instances, all handling requests in parallel. Requests which require writes generally followed the pattern of first reading leaderboard data, modifying the structure in memory, then writing it back to Redis. To prevent concurrent and/or conflicting updates, the Scala service employed very aggressive locking around most write operations. This locking obviously created write latency and computational overhead, but when combined with chunky shards and hot spots, it also created large amounts of lock contention. Sometimes, during high load, many requests would time out waiting for a lock.
To combat these failures, the Scala service also had very aggressive retry behavior on lock timeouts or failures of other commands. If a request did not succeed on the first attempt, it would be retried several more times (with exponential backoff). Retries aimed to limit total request failures in hopes of keeping data consistent, but these aggressive retries and lock acquisiton requests often amplified problems during overload scenarios, as retrying a failed request on an already overloaded system made the overloaded situation worse.
Despite all our retries, if we determined that we had read inconsistent data, the leaderboards service would kick off repair jobs to fix that data in the background. These background tasks often generated additional non-trivial load on the system. If the system was already overloaded, this caused increased write failures and data inconsistency. That inconsistent data would then triggering more async repair operations, which added more load, etc. You get the idea.
Certainly using locks to maintain consistency is a not a bad idea (noting they harbor their own challenges), but the choice of 100 shards, Redis, and the method of locking created technical obstacles to leveraging that locking effectively. No amount of retries, async background workers or other data repair techniques added to the system could fix those systemic issues.
Summary of Technical Challenges
With a better grasp of the history and issues of segment leaderboard system at Strava, we identified several technical challenge themes:
- The growth rate and size of data is quite large. Strava is growing, and many segments have now seen nearly 100,000 athletes attempt them. Popular segments have well north of half a million efforts. Redis node memory consumption grew 43% in the past 12 months.
- Data accuracy and consistency is of the utmost importance. Our users care deeply about their leaderboard results. They often work hard and challenge themselves to get a better time on a particular segment, and if we fail to update a leaderboard with that new effort, we’ve disappointed them.
- New efforts must be visible on leaderboards with low latency. The normal use case for our user is to upload an activity, and then go see how they fared on leaderboards within a few seconds of their upload completing.
- Leaderboards have a strong reliance of synchronizing writes. For any effort write occurring, the system needs to A) retrieve all existing best efforts on the same segment, B) determine if the incoming effort is better than any of the existing efforts, and C) if so update the data store with the incoming new effort. The leaderboard system needs to be able to do all this atomically, without concern for other threads updating data at the same time.
- The variety of leaderboards presents a challenge to the data store. We want users to be able to view complex multifaceted leaderboards — i.e, see who was the fastest female this year who is 55–64 years old, or which of my male, club members had the fastest time on the climb they all just rode today. This often means needing to return ordered, ranked, and sliceable (for pagination) sequences of efforts, sometimes made up of arbitrary sets of athletes determined at query time.
Previous leaderboard system implementations often tried to solve for one or all of these problems, but usually one solution came at the expense of the others. For example, the Scala-backed approach placed a high value on trying to maintain accuracy with low latency updates and trying to maintain consistency after a write, but at the expense of poor support for synchronizing writes or crafting systems that didn’t scale (at least horizontally) with the growth of data.
Our future leaderboards system implementation would aim as best as it could to try and balance all of these requirements in a way that meets the needs of the product the best.
In understanding the design of any new large system, it’s important to have a strong understanding of the historical context from which the system is being built, along with a survey and an understanding of the themes and common problems the system intends to solve.
With these goals in mind, for the future installment of this series we’ll be digging deeper into the technical challenges of re-architecting a leaderboards system, determining the first principals of what a new system would eventually look like.