Tessa: 1,000,000,000 Strava Activities, 1 Spatiotemporal Dataset

This is the sixth post in Strava’s Intern Blog Series, where we give interns the opportunity to talk about the projects that they’ve been working on.

In May 2017, Strava surpassed one billion activities uploaded. The information extracted from each of these activities powers Strava features that I love: activity grouping, segments, and FlyBy. The backend Scala service responsible for these features? Tessa.

As I explained in my previous blog post, activities are indexed first by converting their geospatial and time data into a Tile and Timestamp pairing. A Tile is further broken down into microtiles, and a bitmask indicates which microtiles within a particular Tile were intersected during each minute of an activity. All this information is inserted into a Cassandra database, which allows for activities that took place in the same locations, or across the same Tiles, to be easily matched. This database of indexed activities becomes the starting point for processes such as activity grouping and segment matching.

A Closer Look at Cassandra

Apache Cassandra is a distributed database, which means that all data stored in in it is partitioned across different nodes. To achieve this partitioning, each “row” of data has a partition key, which determines which node the data is stored on. All data that has the same partition key belongs to one partition. An additional key, the primary key, determines how the data is ordered within a partition.

To index an activity, Tessa stores four pieces of information for each Tile an activity intersects: the activity ID, the Timestamp corresponding to the minute of activity, the Tile itself, and the bitmask of the microtiles within the Tile intersected by the activity. Tessa uses the Tile as the partition key, and the Timestamp and the activity ID as the primary key. As queries to a Cassandra table must specify the partition key, this particular organization of Strava’s activity tiles data allows Tessa to find all the activities that took place in a specific set of tiles, ordered by time and activity ID.

tile | timestamp | activity_id | bitmask

Tessa’s Cassandra table schema. Tile is the partition key, and (timestamp, activity_id) is the primary key determining the ordering.

Storing one billion activities worth of activity tiles data is an ongoing infrastructure challenge. As Strava’s user base has grown, certain partitions of Tessa’s data have also become unsustainably large, as some geographic areas see more Strava activity than others. Complications, such as slow database reads, can arise with large partitions. An approach that can be taken for large partitions is using a composite partition key, which means that only data rows with the same value for each component of the composite partition key will be hashed to the same partition. Following this approach, we converted the Timestamp of each data row into a week and an offset, and partitioned Tessa’s data by Tile and week. Within each partition, the offset and activity ID determined the ordering.

tile | week | offset | activity_id | bitmask

Tessa’s new Cassandra table schema, using a composite partition key of (tile, week).

Having the opportunity to implement this new data partitioning schema for Tessa has given me an appreciation of the unique data Strava uses to power its product, as well as an understanding of the scalability challenges any software infrastructure faces. While improvements to backend services like Tessa doesn’t result in a new product feature, it is nevertheless rewarding to work on a project that ensures existing features continue to work reliably.

Testing Spark Streaming

When an activity is uploaded to Strava, its space-time point streams have to be converted into activity tile data and inserted into a Cassandra database. Uploaded activities result in the generation of Kafka messages known collectively as “topics.” Tessa subscribes to the event-activity topic and employs a Storm topology, which converts the stream data into database rows.

Storm is a streams processing engine that Strava is aiming to replace. While working on Tessa, I had the opportunity to explore one possible alternative, Spark Structured Streaming. Spark is an engine for distributed computation, allowing large amounts of data to be processed in real time. Structured Streaming extends the Spark engine with streaming capabilities. With a Structured Streaming job, a DataStreamReader reads messages from Kafka. Information about each activity is extracted from each message and the activity stream corresponding to an activity is converted into a dataset of activity tiles. Meanwhile, a DataStreamWriter continuously writes the dataset of activity tiles to Cassandra.

A simplified comparison of Apache Storm and Apache Spark Structured Streaming, two approaches to streams processing.

One of the challenges in evaluating Structured Streaming as a stream processing alternative was determining how to monitor the jobs in production. At Strava, metrics are produced by application code and can be visualized in the form of graphs, which provide insight into an application’s performance. The three general types of metrics that are collected come in the form of timers, counters, and gauges:

  • Timers: measure how long something takes
  • Counters: measure how many times something occurs
  • Gauges: measures the value of something at a given point in time

In production, metrics are collected and stored in Graphite. Being able to generate metrics from a Structured Streaming job is essential to running one in production. One useful metric from Tessa’s existing Storm topology is the number of unprocessed Kafka messages. Because activities are continuously uploaded throughout the day, the number of new Kafka messages may grow faster than Strava’s infrastructure can process them. In Structured Streaming, a job periodically generates progress updates that can be parsed for a Kafka offset, which can be subtracted from the latest offset to obtain the number of unprocessed Kafka messages. Embedded within the code running a Spark job, a gauge can be configured to send the Kafka offset to Graphite every time a progress update is created.

Metrics are collected in Graphite and helps Strava monitor its services and infrastructure.

Imprecise Metrics

Twelve weeks was quite the “FlyBy,” and it’s hard to believe that I only have a day left of my internship. It’s difficult to fully describe my experiences this summer, but in hopes of encapsulating them, I’ve aggregated some rough counter metrics (which unfortunately cannot be viewed in Graphite):

  • miles.biked = 0
  • embarcadero.tourists.tripped = only 1
  • blog.posts.written = 2
  • coletta.gelato.flavors.tried = 5
  • avocados.consumed = 6.5
  • engineering.tech.talk.pastries.eaten = 10
  • party.parrot.emojis.used = somewhere between 1–50
  • kombucha.drunk = definitely more than a liter
  • miles.run = > 450
  • fun.interning.Strava = immeasurable

A huge shoutout to Varun Pemmaraju and Micah Lerner for organizing the Strava Intern blog series and encouraging me to reflect on my time here. I’d also like to thank the Cult of the Party Parrot, my technical mentor Drew Robb, my manager Steve Lloyd, the rest of the Infra team, and all the passionate and dedicated employees at Strava for an unforgettable summer.