Learning from Strava

A casual runner’s lessons from Strava data analysis in search of a sub 4:00 marathon.

Published in

Ordina Data

4 min readNov 10, 2023

Leading up to my first marathon I recorded all training sessions in Strava. Unfortunately, I fell several minutes short of my sub 4:00 goal. In this blog I discuss how to leverage Strava data to improve my training and race result.

About Strava

Strava is a popular app to record running, cycling and other activities. The in-app analyses and visualisations are helpful, but analysing my training effort collectively requires more fine grained data. Fortunately, Strava provides several export options.

Analysing Training Data

Activities are summarised in a .csv file with metrics such as distance, time and elevation. My activity summaries (plotted below) indicate:

Weekly training volume increased up to 50+ km
Most training sessions varied between 5–10 km.
Furthest training distance recorded was 32 km

To analyse my training effort collectively I regressed distance (km) vs time (hours). This gives me a fair estimate of the pace I should be capable of on race day. With relatively few runs over 20km, I used a Weighed Least Squares regression to assign greater weight to longer runs.

The regression line below has an Adjusted R-Squared of 0.97. Since, a value of 1.00 indicates a perfect fit of the model to the data I was initially skeptical: with real datasets this is often too good to be true.

In this case, my training regimen explains the model fit. I paced all training sessions around 5:41 min/km: the pace of a 4:00 marathon. The regression coefficient, 0.094 hour/km or 5:38 min/km, confirms that I consistently followed this training regimen. A caveat in using this model to estimate my marathon time is that I trained up to 32 km. Out-of-sample predictions (42.2) are inherently more uncertain.

Analysing the Marathon race

Strava summaries each activity in a .gpx file. As illustrated below, every 4–5 seconds my latitude, longitude, elevation and heart-rate were recorded.

<trkpt lat="51.9129940" lon="4.4822980">
    <ele>8.0</ele>
    <time>2022-04-10T08:25:05Z</time>
    <extensions>
     <gpxtpx:TrackPointExtension>
      <gpxtpx:hr>141</gpxtpx:hr>
     </gpxtpx:TrackPointExtension>
    </extensions>
   </trkpt>

The marathon activity amounts to 5634 data points, which can be parsed to a pandas DataFrame with the gpxpy library:

with open(gpx_file) as gpx_file:
    gpx = gpxpy.parse(gpx_file)
points = [
    {
        "time": p.time,
        "latitude": p.latitude,
        "longitude": p.longitude,
        "elevation": p.elevation,
    }
    for segment in gpx.tracks[0].segments
    for p in segment.points
]

Consecutively, I computed the Haversine Distance between the data points and aggregated to 1km segments. The speed per segment is plotted below; the horizontal line represents a 5:38 min/km pace.

The plot shows that I ran the first km slower than planned (I arrived late and got stuck in the crowd). I over-compensated in the next kilometers and continued to run above my goal pace. At around 30 km my pace really starts to dwindle. This is called “hitting the wall”, which commonly occurs at this point in the marathon (as found by this study).

The plot below compares my race vs a ‘constant pace 4:00 marathon’. The intersection illustrates that my margin has evaporated at the 37.5 km mark. At this point I lacked the energy to increase pace and finished in 4:07.

Conclusions

I hope you enjoyed reading this blog. From a personal point of view it was interesting to quantify what I felt during the race. My main lessons for a future marathon are:

My training was far from optimal: most training plans advocate for variation in pace and intensity.
Stick to the race plan and avoid running too fast in the first kilometers
Replenish energy consistently during the race to avoid (or postpone) hitting the wall
Arrive at the race well ahead of time to avoid getting stuck at the start