Getting the Most Out Of Your Sports Tracker

How to use Python & Strava to visualize almost anything for your activities

Daniel Issing
Nov 19, 2020 · 8 min read

If you’ve come across any of my articles before, you will have noticed that I’m very much into long-distance running (see “The What, How And Why of Ultrarunning”). I’m also a data scientist by trade, and I’ve been consistently logging my activities with Strava ever since I got serious about running in 2016. So what would be more natural than to try and analyze the data thus generated myself? And after playing around with it for a while, I realized that this might be interesting for other people as well. Hence, I decided to publish a small series of blogposts on the topic.

Now, there are undoubtedly already a great many apps out there that can help you visualize the most minute details of every move you ever recorded. Strava has a fantastic API that allows for their development, and if you only want to display some more stats without seeing what’s going on behind the scenes, signing up for one such is surely the faster way. If, however, you’re a bit like myself and enjoy tinkering with the data, testing different approaches and customize things as much as you like, you’re in the right place here. An additional advantage is that this approach is generally platform-independent — it will sure take some tweaking, but the basic ingredients (like .gpx files) should be available no matter which tracking app you use.

I’m splitting this series in three parts: First, we’ll take a look at the most high-level table that summarizes some basic facts about each activity, and see how we can exploit it. Next, we’ll pick a single activity and analyze the .gpx file that comes with it, which provides information about altitude, position and speed and so forth. In the final post, we’ll combine both to visualize whatever each of them couldn’t do separately. Good tutorials already exist on certain aspects of the first two, but I have not yet seen them woven together in a comprehensive manner.

So without further ado, let’s dive right into it.

Preparing the activities table

The .zip file they’ll send you by email contains a lot of stuff we don’t really need. Once unpacked, you can safely delete everything but the activities.csv file and the activities folder.

There’s an obvious downside to running the analysis on data thus generated, namely that it will be static. However, it is a lot more straightforward and quicker to set up than using the API, and in the end, it didn’t bother me a lot: For my purposes, it will largely suffice to update it every other month or so. So bear with me; if you truly want to use the API, you can properly pick up a few ideas what to do with it from here.

All the scripts and notebooks I’ve been using can be found in this GitHub repo:

The first thing we’ll notice when loading activities.csv into a data frame is that it contains a lot of really useless information (dew point, anyone?), and that needs a lot of formatting. Examples are: Times being given in seconds, way too many digits and columns that are not filled for the most part. Since my preferred domain is trail running, I don’t care about time and distance only, but also a lot about elevation, so I added two columns called km_effort (that’s distance in [km] plus elevation gain in [100m]) and avg_incline (elevation gain/distance). Finally, I filtered out activities that I don’t practice often, like swimming or snowboarding. This leaves us with a nicely cleaned-up frame like the one below:

Plotting histograms

Distribution of distance per activity

For running, for example, there’s quite a sharp peak around the 15km mark, while hikes are almost flat. There are also some noticeable outliers, courtesy of a bunch of ultra races I have done in the past.

Everything is correlated

Unsurprisingly, the ratio moving time/distance is generally lower for running than for hiking, and lower still for cycling. It also looks fairly linear at first glance, but there are exceptions. For running, for example, the ultra distances tend to take disproportionally longer than the other runs, which is not surprising: At some point, you will have to sit down and eat or even do a power nap, as the fatigue is taking its toll. For cycling, on the other hand, there’s an outlier past the 200km mark. Do I cycle faster the longer I sit on my saddle? Not necessarily; it’s just that a lot of bike tours I do tend to be relaxed, with sightseeing and lunch(es) in between, whereas the super-long ride really was an attempt to go as fast as I could.

Okay, so let’s just try to fit simple linear regression model to it. Here’s one such attempt for elevation and elapsed time:

That particular example looks quite messy and reinforces the point made above: Better look at one activity type at a time instead of all of them at once. But you will no doubt notice an important problem with my particular dataset: There are a lot of values in the lower ranges and just a few entries past a certain ‘median’ threshold. This can be a problem depending on how the regression routine works; I simply relied on a predefined algorithm here. Under certain conditions, we might want to give equal weight to all data points, under others, stress long distances more.

The next plot shows a regression analysis for the km-effort (remember, a measure of distance + elevation) versus the average speed (in m/s), now only for runs. It confirms our suspicion that we get slower the longer we run:

The weird outlier on the top left is due to a recording mistake (poor GPS during run).

The correlation isn’t too impressive, but check what we get when comparing km-effort and elapsed time:

Not too bad, right? In fact, such a graph should allow me to predict the time I will need to finish me next race, and even when I’ll roughly arrive at the aid stations in case someone wants to meet me there, which is quite cool! I will come back to this topic later to see if we can do better than fitting a line to it, and what other metrics should enter such a prediction.
[One reason for postponing this is that the seaborn tools I used to create these graphs do not return any information about the fitted function, so I’d have to read off the values from the graph.]

Aggregating activities

Even if you don’t know me, it shouldn’t be too hard to guess which of the three is my favorite sport.

Instead of grouping by activity, we can also group by year, to compare how well we’re doing this year compared to the last. It wouldn’t take a lot of effort to change the code in a way that would show only one type of activity in this case, but I’ve summed them all together here as a kind of measure of how active I was in a given year:

The nice thing (for me, at least) is that I did better this year than ever before, but because of a surgery, I won’t be able to add anything else in 2020. That’s life! On the bright side, I probably wouldn’t have the time to write the code for these analyses otherwise.

A final remark on using the notebooks

The Startup

Get smarter at building your thing. Join The Startup’s +737K followers.