One of my favorite perks of working at Strava is being able to find co-workers who are always up for a run, whether it’s the weekly lunchtime WoW (Workout Wednesday) or a pre-dinner jog. As an example, take one of my most memorable runs this summer: a 5:30 AM climb from Mill Valley to the Mt. Tamalpais summit for the sunrise on a Monday morning before work. One co-worker looped around the empty streets of San Francisco at 4:30 AM, picking the rest of us up before driving across the Golden Gate Bridge into Mill Valley. The four of us and a dog set off at dawn, running steadily uphill. We broke out above the San Francisco fog layer after several steep ascents, and were greeted by the sun and a spectacular view of the clouds from above. We scrambled up the steep and narrow trail to the summit, snapped a photo, and took a longer, gentler descent to our starting point. Afterwards, while we stretched by the car, we all synced our activity to Strava (of course) and prepared to roll into the office for a productive day of work.
This summer, I learned that not only did group activities (like our Mt. Tam run) tend to garner more kudos on Strava, but also involve a complicated backend Scala service we call Tessa. As a software engineer intern on Strava’s Infrastructure team, I spent a large part of my summer internship working on parts of Tessa. I had the opportunity to look into how Tessa takes activities and groups them together so that in the Strava Feed, an activity such as my run up Mt. Tam is displayed as “Lindy Zeng and 3 others.” The process of activity grouping is too complex for one (or even several) blog post(s), but I want to share my insight into it by explaining how Tessa indexes an activity by storing it in a Cassandra database.
While an activity contains a lot of information, the part that Tessa works with is an activity’s stream of space-time points, which is a sequence of tuples pairing the latitude and longitude coordinates of an activity with the time it occurs in seconds.
The space-time point shown is taken from my activity and corresponds to the summit of Mt. Tamalpais at 6:32 AM on July 10, 2017.
It’s useful for geospatial coordinates to be stored in a database as a Geohash, which is where the concept of tiling comes in. The world can be tessellated, or arranged, into a grid of tiles. Each tile contains smaller tiles, each of which contain even smaller tiles, etc. A tile can be represented by a string, and smaller tiles are formed by tacking on characters to its parent tile’s string “prefix.” Using the Geohash geocoding system, latitude-longitude coordinates can be hashed to tiles of varying size or specificity. The tiles that Tessa works with are six-character prefixes, which have side lengths of approximately 1 kilometer. These tiles are stored in a bit format which are 30 bits long and known as a
Tile. Tessa also works with microtiles, which make up a
Tile and have side lengths of approximately 50 meters.
Tessa indexes an activity by storing all the microtiles intersected by the activity’s stream of space-time points in a Cassandra database. For each minute of an activity, the space-time points are converted into a
Timestamp pairing, along with a bitmask indicating which microtiles within the
Tile were intersected, and inserted into a table. Using the table of indexed activities, an activity is coarsely matched to other activities by querying for activities that intersect the same tiles in a given time range. From this group of coarsely matched activities, Tessa begins a more sophisticated grouping based on the spatial and temporal similarity of the activities.
And thanks to Tessa’s activity grouping, half the office seemed to know about my early morning adventure before I stepped through the door. By interning at Strava, I have explored San Francisco and beyond and been exposed to so many technologies I knew nothing about before (Scala, Cassandra, Mesos, just to name a few).
Here’s to a summer of more learning, running, beautiful views, and to Strava for being there for it all.