Rebuilding the Segment Leaderboards Infrastructure — Part 3: Design of the New System

Over the past year, the Strava platform team has been working to rebuild the segment leaderboards system. This is the third in a series of four blog posts detailing that process. This post describes the implementation details of the new leaderboards system, showing how it achieves the design goals set out in earlier posts, and fixes some common problems seen in the previous leaderboards system. For some added context, please read part one, which details the background of the leaderboards system, and part two, which distilled the problems of the previous leaderboards systems down to a set of principles a new system should solve. For the next article in the “ Rebuilding the Segment Leaderboards Infrastructure” series, check out Part 4: Accessory Systems.

Goals Review

Reflecting back on part two of this series, we have two systems we need to build out for the new leaderboards service:

  1. A stream processing system, which is strongly ordered, partitionable, and idempotent.
  2. A data storage system which is distributed, able to store all leaderboards in a homogenous way, and supports fast writes.

Stream Processing System

If you need a strongly ordered, partitioned backing store for messages in a stream processing system, Kafka is the obvious choice. Out of the box, it offers replication, strong ordering, high availability, and partitioning — exactly the system properties we need. There are certainly other messaging solutions which could be considered; those run the gamut of everything from Redis (via PubSub), to more mature and pure messaging solutions such as ActiveMQ or RabbitMQ. These others messaging systems generally do not have the same durability, availability, or ordering guarantees as Kafka, and we already used Kafka in production and liked it, which made it a relatively easy choice from there.

If you’re not familiar with Kafka, I’d suggest spending some time reading the Kafka documentation, specifically the section on its design. This will make the rest of this post easier to follow.

Design

With Kafka in hand, let’s figure out how to go from efforts being created, updated, or deleted in our Rails app, to an ordered, partitioned, stream of add and remove leaderboard update messages.

A naive solution may be to have the Rails application log the new effort state to a Kafka topic each time the effort is mutated. A downstream consumer could then simply consume those effort mutation messages, and apply updates to the data store as needed.

This seems like it would work, but there are consistency problems. While effort mutations are strongly ordered within the Kafka partition, we’re still at the mercy of the message publishers to log them to that partition in the correct order. As part two of this series points out, efforts are mutated in our Rails application in parallel (occurring in web transactions, background jobs, etc) without any sort of synchronization or ordering. Consequently, effort mutations can (and will) be published out of order, which renders the order of logged mutations more-or-less useless.

There is hope, though. One important property of our system is that for the entire lifecycle of an effort, two important fields are immutable: the segment the effort traversed, and the user whose activity the effort belongs to. Knowing this, we can instead use the effort mutation messages logged by active only as a “notification” that something about the effort changed. The message doesn’t describe what about that effort specifically changed, we only know something changed and the leaderboard store may need to be updated.

Under that model, we can push the ordering requirement further down into the leaderboards system, where we have more control. If the worker consumes the partition of effort mutation in order, as it processes each message it can query our canonical effort storage for the latest effort data, and then make a decision on what kind of update to apply to leaderboards at that point.

No matter what order the effort mutation messages were logged in, the worker will eventually update leaderboard storage to the correct data.

The key takeaway here is that this architecture is eventually consistent, and handles all race conditions and out of order logging of effort mutations from the Rails app. This is due to a couple important properties:

  1. The actual transaction mutating the row in the database for the effort commits before the effort mutation event is logged.
  2. A correct leaderboard update can be determined based on the state found in canonical effort storage, for any possible mutation event.

Even if the worker reads an effort row from canonical storage, and then another actor in the system immediately updates the effort, rendering the worker’s read stale and invalid, the worker can still safely apply its now stale update. This is safe because the worker will eventually consume the mutation logged by the actor due to its effort update. On consumption, the worker will then read the final correct effort state, and update leaderboards accordingly.

Immutability

Before declaring victory with the stream processing system, it’s also important to think about our other stream processing goal of idempotency. What would happen if we replayed a message more than once? Or, since we will be implementing our worker as a Kafka consumer, what happens if it crashes, and then restarts its consumption from an earlier offset in the log? Can leaderboard consistency be maintained?

Thankfully, yes. If we consider each effort mutation as a stateless notification the effort mutated in some way, then being re-notified of a mutation will simply query the latest state again. Our consumer will still attempt to issue the leaderboard update again, but update is based on the most recently queried state from our canonical effort storage. In the even more complicated example of our consumer crashing and restarting its processing from an earlier point in the log, the same property still holds. The worker will attempt to issue the leaderboard update again based on the latest data.

Set Up for Success

The end result of this design is that we have a system which turns a potentially out-of-order series of effort mutations into an ordered sequence of leaderboard updates. As long as sequential processing of a partition occurs, effort updates are idempotent, and rewindable, due to the strong ordering guarantees of a Kafka partition and the causal relationship of updates in the system.

Understanding the properties of messages in your system often times can reveal characteristics you are able to leverage for efficient and accurate processing. In particular, the leaderboards architecture sets us up nicely to use event sourcing and command query responsibility segregation (CQRS) in building later components of the segments leaderboard system. We will discuss those in part four of this series.

Data Store

With our stream processing architecture figured out, it’s time to turn our attention to the data store. To review: in part two of this series we noted the desire for a store that was distributed and highly available, as well as being able to achieve a high write load, and store all leaderboards in a homogenous way.

Influencing our data store decision was the preexisting usage of Cassandra at Strava. We use Cassandra in a variety of other systems already, and it was a common opinion that it could serve as a good candidate for the leaderboards store. On the surface, Cassandra certainly met a lot of the project’s needs — it’s distributed, highly available, able to support a high level of writes, and had a strong emphases on ordering. Consequently, a lot of our work in choosing the leaderboards data store went into determining if Cassandra would be an ideal fit.

Cassandra

Our confidence with Cassandra mostly mostly hinged on two main questions:

  1. Is it possible to construct a suitable data model/table schema using Cassandra?
  2. Will Cassandra’s eventual consistency work for our use case?

Table Schema

Modern Cassandra schemas are generally defined in CQL, a SQL-like abstraction over native Cassandra data structures. So it’s best to think of the Cassandra schema in terms of more traditional SQL-like rows and columns.

Given our leaderboards use case, we crafted a schema we felt would work:

CREATE TABLE leaderboards (
segment_id int,
leaderboard_type int, // overall, male, female, etc.
elapsed_time int,
effort_id bigint,
effort_data blob,
user_id int
)

Each row in a Cassandra table is keyed by a unique PRIMARY KEY. Choosing a PRIMARY KEY, along with Cassandra data modeling in general, is a complicated topic, but some distilled guidance is best summed up via a blog post by Datastax as:

Rule 1: Spread Data Evenly Around the Cluster
You want every node in the cluster to have roughly the same amount of data. […] Rows are spread around the cluster based on a hash of the partition key, which is the first element of the PRIMARY KEY.
Rule 2: Minimize the Number of Partitions Read
Partitions are groups of rows that share the same partition key. When you issue a read query, you want to read rows from as few partitions as possible.

One final note: within a partition, rows are ordered by the remainder of columns in the PRIMARY KEY, known as the clustering key.

Knowing this we might chose our PRIMARY KEY as:

(segment_id, leaderboard_type, elapsed_time, timestamp, effort_id)

Here, segment_id is the partition key, as it is the first element in the primary key. The remaining columns — leaderboard_type, elapsed_time, timestamp, effort_id— make up the clustering key, which defines the ordering of rows within the partition.

This seems great — with segment_id first in the PRIMARY KEY, all rows for a segment are partitioned onto the same node. In aggregate, this should spread all effort rows evenly around the cluster, with any one leaderboard request contained to a single partition. Within a partition, rows will be ordered by the leaderboard_type, and then by elapsed_time — just like we want for an actual leaderboard. This schema also allows us to store all leaderboards in a homogenous way, another one of the properties we wanted in our data store.

Good for Reads, Bad for Writes

While this schema works well for storing ordered sequences of efforts in a leaderboard, it is actually pretty awful at retrieving the effort for one particular user. This is a problem for our aforementioned worker consuming effort mutations. For any given add or remove message processed, our leaderboard service needs to query the leaderboard store for the user’s current best effort to compare it against the incoming effort. Due to the PRIMARY KEY we chose, we can’t efficiently issue that query. Cassandra only allows contiguous sequence of rows to be returned, and a particular user’s rows, for a segment, are interspersed amongst the ordered rows for all users for the segment.

We could maybe adjust the PRIMARY KEY to include the user_id:

segment_id, leaderboard_type, user_id, elapsed_time, timestamp, effort_id

This would make it trivial to query for an user’s best effort for that leaderboard_type, but now rows on disk are ordered by user_id, and not by elapsed_time like we want.

So what do we do? We looked at this and evaluated two options.

First, we considered having two separate tables. One table would have user_id in its PRIMARY KEY to allow efficient querying by user. Then, once that table had been updated, we would write the results to a second table, which had its PRIMARY KEY ordered for a leaderboard. This certainly would work, but keeping two tables in sync like this is tricky, and opens us up to a whole host of processing complications, edge cases, and failures you need to reason about and account for.

Second, we looked at using a Cassandra feature called secondary indexing. Secondary indexes in Cassandra are a way to add an index on, and thus filter by, a column that is not part of the clustering key. There are many people online who discourage the use of secondary indexing. Their objections are based mostly on the fact that while secondary indexes have plenty of use cases where they seem to solve a problem, due to their implementation they do not actually work that well. Additionally, they will slow down writes, as you need to update both the index and table on each write operation.

However, if you follow the guidance of the Cassandra documentation, understand your use case to ensure it fits, and dutifully benchmark your implementation, secondary indexes can work well. And thankfully, our data and query pattern was a good fit for a secondary index on the user_id column, so we created a test schema, benchmarked it, and determined that it would actually work well for us.

There is still one remaining problem with our Cassandra schema. Even with the index on user_id, determining the ranking of a user on a leaderboard is not straightforward. To answer this question, we need to know which number row, in order, the user is from the top of the leaderboard. There is no CQL operation that we can use to achieve this in an efficient way, so we have to query for all rows in the table, and manually determine the user ranking ranking in memory. This is not so bad for small leaderboards of a few hundred users, but many leaderboards are in the high tens of thousands of users in size. The cost of moving that amount of data across the network, combined with the computational cost of processing that many results is likely to be untenable.

This type of query is not handled well by most data stores, so we’re not really missing a whole lot by using Cassandra, but the situation is certainly less than ideal. To combat this problem, we’ll want to ensure that we both minimize the number of queries of this type as well as cache leaderboard data in such a wayas to make this class of queries more efficient. Stay tuned for part four of this series, where we will discuss this problem and the caching system in more detail.

Consistency Model

The other big consideration when using Cassandra is understanding its consistency model. Cassandra operates under last write wins (LWW) consistency, where each INSERT or UPDATE (really, an UPSERT) is blindly appended to the database, without checking if a duplicate record exists. Then, during reads, if Cassandra encounters two versions of the same row, it chooses the newer one. “Newer,” defined as the row with the latest timestamp.

The “timestamp” a row is written with can be used-defined, but generally is the current timestamp. This may sound like a scary way to do conflict resolution (there are plenty of ways it can break), but it’s incredibly fast and easy to reason about. In any running system you will likely see a small degree of data inconsistency, but nearly always that is a worthwhile tradeoff to the benefits of availability, replication and performance.

There are also ways to lower the rate of data inconsistency. Cassandra allows tunable consistency parameters, which you can use to ensure all read and write operations are executed on a quorum number of nodes. Doing so will hurt both read/write latency and availability (as a larger number of machines need to be healthy in order to complete the request), however, you can use this technique to prioritize consistency over availability if the performance hit is manageable for your application. We also run Cassandra configured with 3 replicas of data (r value of 3), allowing us to tolerate the loss of one node and still successfully complete reads and writes on a quorum number of nodes (2). Granted if we lose 2 nodes we will stop serving requests that require those nodes to establish a quorum, but losing two nodes is unlikely to happen in isolation without a more systemic widespread networking issue affecting many parts of the system.

The leaderboard system also structures its updates to reduce the likelihood of conflicting updates. Since leaderboard updates are partitioned by segment/user, the likelihood of conflicting updates is lowered dramatically, as no other actor in the system can be reading or writing the same row at the same moment. There are certainly scenarios where this fact could become false — i.e. a consumer pauses (perhaps due to GC) during message processing, a replacement consumer is started, and then the original consumers resumes processing. Generally these types of situations are incredibly rare, or at least rare enough that the engineering work and system performance hit incurred by engineering to prevent them are not worth the small amount of data inconsistency seen.

After some more benchmarking, testing, and dual writing leaderboard entries to Cassandra, as well as the old system, we eventually determined Cassandra would be a suitable data store for the leaderboards system.

Conclusion

Our leaderboard system is composed of two main components:

  1. A Kafka stream processing system, which consumes effort mutation messages logged from the Rails application, fetches the latest effort data from canonical storage, determines an update to apply, and then applies it to leaderboard storage.
  2. A Cassandra leaderboard store made up of a single CQL table keyed by segment and leaderboard type, with a secondary index on user_id to facilitate querying by user.

These two systems work in conjunction as a pipeline to transform mutations of efforts by our Rails application into a denormalized Cassandra index of the best effort for any given segment/user/leaderboard combination. This index can be used to serve leaderboard read requests quickly and efficiently, with some known drawbacks and deficiencies that can be solved through strategic caching.

While this final architecture may seem simple or obvious, arriving at it took many weeks of analysis, prototyping, and benchmarking. This article also skipped over a decent amount of extra functionality and implementation details of the final production system. Some of that detail will be discussed in part four of this series, where we will review accessory systems built in conjunction with leaderboards, such as a leaderboard aggregate counters and a caching system, as well as enumerating additional benefits of this architecture.

Continued Reading

For the next article in the “ Rebuilding the Segment Leaderboards Infrastructure” series, check out Part 4: Accessory Systems.