Analytics at Strava

Published in

strava-engineering

3 min readSep 18, 2014

Just as our athletes are obsessed with tracking and analyzing their athletic pursuits, we here at Strava are obsessed with tracking our own performance; but instead of wattage and pace, we’re concerned with activity and engagement. In this post, I’ll talk a little bit about the infrastructure that powers our internal analytics and reporting.

The system requirements for analytics are often quite different from those of powering an application, especially at the complexity and scale of Strava. Initially, most analytics were run off of slave instances of the same SQL databases used in our backend. Eventually, our data outgrew what could be handled on a single server, forcing us to partition it across multiple servers. As a result, any query which required a join between data on different servers could no longer be expressed purely in SQL. Not a problem for engineers, but a huge barrier for business analysts and other data-savvy, but non-technical staff.

Enter Redshift, Amazon’s data warehouse solution. Every night, application data from the previous day is replicated to our Redshift cluster. Additionally, most features on Strava are instrumented via our logging infrastructure. Whenever an athlete views an activity or a feed on Strava, either from the mobile app or the web, event data is fired off to Kafka, aggregated, and periodically saved to S3. This data is then loaded to Redshift, where it can be queried alongside the rest of our relational data.

Data in Redshift is not indexed in the same way as traditional OLTP databases. Instead, each table is defined with a distribution key, and a sort key. The distribution key dictates which data lives on which node, while the sort key defines the order in which that data is stored. Defining the appropriate distribution keys is essential for SQL JOIN performance. In our Redshift cluster, wherever possible, data is distributed using an athlete’s unique ID. All data relevant to any given athlete lives on the same node, making the majority of our analytical queries quite fast.

Having all our data available in one place greatly simplifies the task of asking questions about athlete behavior. As an example, here’s a tidbit I pulled recently showing Strava usage over 2014 by day of the week. The orange line tracks the count of uploading active members by day, while the blue line tracks non-uploading active members (e.g., someone who has not uploaded an activity, but still logs into Strava to view/comment/kudo an activity).

As you can see, a fair number of people are browsing Strava, even when they aren’t being active. This is especially true of Mondays, and to a lesser extent, Fridays. This makes sense — these are the two days where anyone with a desk job spends his or her afternoon thinking wistfully of fresh air and open roads. At Strava, we’re doing everything we can to make those daydreams and memories more vivid.

Originally published at labs.strava.com by Carlin Eng.

Analytics at Strava

Written by Strava Engineering