Strava Metro — Scaling Aligned edges

Published in

strava-engineering

7 min readMar 3, 2022

Strava’s Metroview product helps governments and transportation advocacy organizations build better transportation infrastructure by providing an interface into activities — commutes or for leisure — recorded by Strava athletes. One of the key aspects of transportation planning is modeling the impact of changes in infrastructure. As transportation infrastructure changes, Metro needs to adapt our underlying basemap to reflect those changes.

At Strava we use OpenStreetMaps (OSM), the open source world map, as our basemap. We use this map in both the Strava athlete-facing product and at Strava Metro. This map has regular changes thanks to contributions from organizations and individuals around the world (you can even make changes to the OpenStreetMap basemap here).

The core of the Metroview product is a dataset of Strava athlete activities matched to “edges”, which are basically sections of road/trail in OSM. GPS data uploaded to Strava is matched to the basemap for the Metroview dataset, similarly to how we match segments. As you might imagine, this dataset is one the largest datasets at Strava, since we have over 6 billion activities, each traversing many sections of road. On Metro, we update edge alignments once a month. This is a hefty workload even if we are only appending the 100 million new activities each month. The workload consists of both the step to actually align those edges to the basemap and the step to write those alignments to a database.

At Metro, we provide an interface into that dataset of aligned activities, giving the ability to slice and dice the data by time, activity type, and purpose. We also allow Metro partners to export data to load into GIS (geographic information systems) tools, giving them the ability to combine Strava data with other GIS-compatible data, like political boundaries, census data, terrains or localized transportation restrictions.

The Metroview product displays counts of activities across sections of road.

The problem — upgrading the basemap

Since Metroview on web is a relatively new product (launched in mid-2019), we haven’t yet needed to upgrade the basemap on the Metroview product. Of course, upgrading our basemap was soon to be a requirement for our platform, especially as the Strava athlete-facing product continued to update the basemap to support route building. This meant that continuing to operate on a legacy basemap held back Metro’s technology stack from much-needed updates to the rest of the Strava platform. However, there was the rather challenging problem of backfilling all of our activities to match against the new basemap and subsequently store in our existing Cassandra database for activity aligned edges. A back-of-the-envelope calculation of the time it would take to write to Cassandra at the throughput of our current write job indicated that the backfill would take around a month to complete. On top of that, we would need to drastically increase the size of our Cassandra cluster, potentially doubling the storage cost for this relatively low-touch dataset. Specifically, we don’t need the on-demand write performance of Cassandra for this data. Further, keeping the dataset up to date would soon prove unsustainable if we upgraded our basemap again in the future.

We needed a cheap, write-once, fast read data store optimized for the sort of access pattern we see in Metroview — lookups on edge traversals on a particular edge for a particular time frame. Fortunately for us, our edge ids are monotonically increasing alongside the geohash encoding of the edge’s location, meaning that adding an index to the store on edge ID would yield performance benefits via cache locality (see here for an intro to Strava’s Routemaster service). So the indexing we’d need on this data set is clear: a primary index on the edge id, where records with the same edge id are sorted by timestamp.

The solution

We settled on using S3 to back an in-memory cache of edge alignments. Why S3? Large datasets can be written to S3 from Spark easily, and storage costs are smaller than that of production databases. Though most don’t generally consider S3 as a primary database, this scenario works well for our relatively low read volume (Metro currently has about 2,000 partners). Therefore, the performance of concurrent reads to our datastore is not as vital an issue as it is on the athlete-facing product. The architecture of the solution is pictured below.

Generating the data set

Our edge alignments are in fact already stored in S3 to support features like routing (which will be discussed in a future blog post). However, the data is not indexed or formatted for real-time retrieval SLAs or the types of queries we get in Metro. To fix this, we wrote Spark jobs to transform the unindexed alignments data to a sorted and indexed byte-packed format. We partitioned the dataset by ranges of edge ids and sorted each partition by date and activity id.

In Spark, we first range partition the dataset by edge id, and then use a sortWithinPartitions for that secondary sort. This combination of operations results in a large memory footprint for each core, meaning we needed to avoid oversubscribing our Spark cores to reduce the memory footprint per executor.

We chose to manually write files with the sorted records within a mapPartitions. The mapPartitions operator takes a functional argument that operates on an iterator, which does not require all data in the partition to be stored in memory. We can group the records within the iterator and write each group with an S3 connection initialized per-partition. Thus, each partition (representing the records for a range of edge ids) is written into a set of fixed-size files in S3.

We wrote each record (see the schema below) as bytes into a ByteBuffer to create each file. This condenses the data more efficiently than Java or Kyro object serialization and has the added benefit of being performant on the read side. With a fixed number of records per byte file, we end up writing (numRecords / recordsPerFile) files to S3.

A 36 byte record to store activity alignments

Reading the data set in the server

The data in S3 is stored in an indexed, sorted, and byte-packed format. We load the index upon service restart or startup. Since the records and the files are of fixed size, we get a free index on the serialized array of objects once we read the data on the server. Each key in our index corresponds to a range of edge ids, pointing to a prefix in S3 containing a small set of byte-packed files. If a client requests an edge that isn’t already stored in the server cache, the server looks up which range the edge would be in, and retrieves the byte files from S3.

Since the files themselves contain bytes, we can load directly into a Java/Scala byte array upon load. Even better, Java System.arraycopy can be run concurrently on multiple threads loading into the same Byte array. This allows for multithreaded read of multiple files from the S3 location. Thus, the load is network bound within the AWS network. Once they are loaded into the byte array, we then use the pre-sorted property of the records in the array to look up edge ranges. This gives an efficient O(log n) search time (and O(m) processing time, m being the number of results) on subsequent edge loads near the first loaded edge.

The size of the cache is limited by the memory allocated to the service. However, by adding a consistent hashing scheme with ThriftMux, we can significantly increase the amount of available memory for the full distributed cache. Consistent hashing allocates certain subsets of the cache to each clustered service instance, giving us the ability to arbitrarily scale the size of the cache with the number of instances we assign to the service.

Challenges

The final solution was not exactly what we expected when we embarked on this project. We experimented with some different data formats. Parquet, being columnar, has poor access performance for production use cases. Elsewhere at Strava, we use PalDB as a binary S3-based store, but our requirement for a secondary index and the large size of the full dataset ruled the option out. Also, sorting data on read was also not performant enough for production. Ultimately, we have had good performance in a variety of distinct use cases for byte packed S3 storage in our Routemaster service, so emulating that solution turned out the best for us.

Result

The result of our swap over to this S3-backed cache over Cassandra saw enormous benefits in our load times. Anecdotally, we saw 2x to 10x improvements on edge selection loads in Metroview depending on whether the edge data was cached in the Metroview backend service. We expect to save quite a bit of cash, too, as we are no longer running an expensive hosted Cassandra database. We’ve enabled future basemap upgrades on Metro, and enabled Metro’s tech stack to come up-to-date with the rest of the Strava infrastructure.

There are some downsides, too. Namely, we significantly increased the memory footprint of our backend service to accommodate the new cache, and we need to maintain this new storage solution in-house.

Overall, we’re excited about the directions that Metroview can take us with these performance improvements. You can check out the Strava Metroview demo here.

Links

Advanced Apache Spark Training — Sameer Farooqui (Databricks)

Using Spark Transformations (FINRA)

Spark memory management