A personal implementation of Modern Data Architecture: Getting Strava data into Google Cloud Platform

Published in

Slalom Data & AI

9 min readMay 26, 2020

Throughout the COVID-19 lockdown, I have noticed a trend for my exercise to be longer but less frequent. Gone were the short daily walks to the station, in were longer walks and rides every couple of days. I had been tracking this data with Strava for years, so thought I try to load the data into my reporting tool of choice to visualise this.

The question was, how would I get the data into my data warehouse?

In years gone by, I would have written one gigantic Alteryx or Python script to download my personal data from multiple sources, combine it, and store it in a PgSQL database sitting on a server behind my TV. That script would need running manually, and I had become weary from the number of changes I’d had to make to it when it broke in its entirety because one of the multiple sources would change their API specification, or remove it completely (looking at you, Moves and Sony Lifelog).

The lockdown provided me with time to update some of my broken data management pipelines. At Slalom, the best practice we encourage clients to follow is to make flexible, scalable, and decoupled processes. I also wanted to remove the need for me to run a server in my living room and move everything to the cloud.

I’m going to walk through how I transformed a large schedule-triggered data collection job to one which works dynamically based on a webhook, and will still work in part if something should break.

I selected Google Cloud Platform based on its always-free tier, and for this project I haven’t really exceeded this allowance. The architecture will work with AWS and Azure too, but will need some customising to use their components.

Architecture

At Slalom, we have an accelerator called Modern Data Architecture. The principle of this is to break down a data load into zones, which have clearly defined states of data in each zone. Each state includes a data standard which is the technical requirement for data loads or transformations.

Breaking down the architecture like this means that each zone can have a clear set of use cases, which is particularly useful in business scenarios. For instance, you should rarely have consumers accessing your data from the Landing Zone, as this is data you haven’t validated, cleaned, or restructured.

Source Systems to Landing Zone

Despite actually using a Garmin device to track my runs, I decided to use my activity data from Strava, because it also has activity data from other platforms I load manually. Strava is already configured to load completed activities from Garmin. This is all done from the Strava GUI, and like all good Oauth integrations, I have to authenticate at Garmin and then confirm a scope which is the authorisation Strava needs to read my Garmin data.

To get the data from Strava into my Landing Zone, we need a workflow to pull activities from the Strava API and save them as files. There are two components in the API which will allow us to do this. The first is the subscription service, which we need to know when there is a new activity available, and get its Id. The second is the activity endpoint, which we will use to download that activity.

This is broadly what we need to achieve in our application:

Authorise our platform from Strava (one-off). We have to provide the initial authorisation in a browser, but subsequently our application can refresh this automatically.
Create a subscription (one-off).
Receive a subscription message.
Acknowledge the subscription message to the sender.
Process the message.
For each activity, download the corresponding data.

The first two you can do locally whilst you deploy the solution, so I won’t dwell on these now. However, I do need to dwell on point 4. When a webhook receives a subscription message, it is good practice to respond to the message quickly. This means, process the message, but don’t actually continue to process any activity yet. We must do this now — not after we’ve gone off to download each activity. The analogy here is when you receive mail; you don’t open it whilst the postman is standing at your door.

Considering this, it is therefore logical to split our application into two. To do this, we need an internal subscription service to trigger this function, which will be Pub/Sub. Our first function becomes a relay, publishing a message from the external service to the internal, before returning a status code to the Strava API. This is a typical use case for Pub/Sub, and whilst it’s not relevant here, if Strava were giving us a list of activities, our webhook function could iterate over the list and create one Pub/Sub message per activity, which would trigger a range of parallel functions rather than the typical serial list iteration.

We then have a second function to subscribe to Pub/Sub, retrieve the activity ID, then make a call to the Strava API activity endpoint, and save the returned data to Cloud Storage. Ideally, this job does little to no processing, but at the time of writing, I have used the stravalib Python library to handle the API call, which for an activity also includes each activity stream as a list of class objects, so I am converting the objects back to lists before saving to file.

Loading data from the “Landing Zone” to “For Purpose Zone”

So by this point, we would have a JSON file per activity stored in our Cloud Storage bucket. We now want to load this into BigQuery, so that we can consume it.

In enterprise architectures, we should look at doing this in two stages. The first is to transform the data into a consumable structure. This should be something that the database can ingest directly, and matching the desired schema. The second is to ingest into your data warehouse.

Earlier, we created JSON files resembling the data we obtained from source, but actually BigQuery cannot ingest these, as for JSON it requires new-line-separated JSON. This is neither compressed or quick to process, so a better option would be to convert to Avro. At this point, we can also manipulate data we know will need changing, for instance creating geography fields from our latitude/longitude streams, or changing units. If you are doing data quality and governance checks, typically they will happen here too, consuming the data in the landing zone as part of a data management pipeline.

The resulting data sets represent your conformed data, which you know are in the structure you trust, and are now ready for ingestion.

BigQuery is capable of reading directly from external data sources, which is great if you do not need to consume the data regularly. You trade BigQuery-optimised storage for Cloud Storage, which comes with a loss of storage. Alternatively, you can use jobs in BigQuery to load conformed data into BigQuery.

Having explained all that, it isn’t actually what I have implemented. I have minimal transformations to apply, and don’t need to pay for additional storage of my transformed data, so my deployment uses another Cloud Function to read a cloud storage object, make some minor modifications, and insert the data into BigQuery.

This function also needs to know when to be called. This is done by defining an event trigger monitoring the landing zone Cloud Storage bucket directly. You could add a Pub/Sub topic in the workflow here too, but it is unnecessary in our architecture.

Architecture Summary

This is the architecture diagram showing what I have described. The model can be adapted for other platforms, for instance I have it running too for Fitbit (tracking my weight from an Aria scales), Garmin (for non-activity data), and subsets of this for data on flights I have taken or from my Google location history (which requires its own article).

This approach works for all platforms where you have similar flows. I have activity data coming from other sources too, so I can replicate this pipeline for those platforms, and the code shared allows me to add them to the same BigQuery table and identify these by a keyword ‘source’.

Whilst I have been focussed on collecting my geo-history, you may also want to retrieve body statistics, sleep, or food. In this situation, if I were using Garmin directly, I could use the first and second functions to download any data, and then modify the third to process them into different locations.

This network of functions and data at rest is a key component of Slalom’s Modern Data Architecture, and is generally considered best practice for data pipelines. We have small manageable functions doing specific tasks, and can manage each to reach landmarks where we can define standards (for instance, my BigQuery schema, or even more appropriately, the Avro files, had I created that storage bucket).

Here’s what I produced with the data: a map of all my London-based activities (courtesy of Tableau):

The rest of this article is intended as an implementation walk-through, so if you leave me here, I hope you found this useful, and thank you for reading.

Implementation

Each of the Github repositories listed below explain what is needed for each to work. There are some manual steps, which I will try to copy here. If you use all of them together, unmodified, then the summary steps you need to complete are:

Create a new Project in the GCP Console.
Clone the repositories into Google Source Repositories, or modify the Terraform to deploy from in-line or Cloud Storage code. You don’t need to clone the Terraform one there, unless you want to set up CI/CD to update your infrastructure. That is outside the scope of this blog.
Webhook repo: https://github.com/reevery/strava-webhook
Activity download repo: https://github.com/reevery/strava-gcp-pull
GCS to GBQ repo: https://github.com/reevery/strava-gcs-to-gbq
Create your BigQuery table, which is not defined in code. There is an example schema in the third repo above. If you don’t want to deploy to BigQuery yet, then you can omit this and just not deploy the third Cloud Function.
Clone my Terraform module locally.
Define the variables for the Terraform module. I would recommend using a terraform.tfvars file.
Configure your local Terraform environment to authenticate. This can be done by saving a service account JSON file locally, and creating an environment variable called GOOGLE_APPLICATION_CREDENTIALS o point at it.
Run terraform validate to confirm syntax is correct, terraform plan to understand what will happen, and when you’re ready, terraform apply to deploy your infrastructure.
Create a Strava API Subscription. See the documentation in strava-webhook for details.
Follow the authentication documentation in strava-gcp-pull to configure the [Strava] App to notify you of new activities, and to store authorisation details to Secret Manager so that this function can download those new activities.

And that should be it! Go for a run or a ride, and see if it all worked.

If it did, then you may wish to back-fill all of your old data into the cloud. This is documented in strava-gcp-pull/fetch_all.py, which will allow you to download all your activities to Cloud Storage, and let the strava-gcs-to-gbq function do the rest. I suggest you execute this locally, because the Strava API will rate-limit and you’d be better having a hard exception thrown rather than watch thousands of Cloud Functions all get the same 429 Too Many Requests error.

How we can help

Modern Data Architecture is more than just moving your data to the cloud. It is an architecture with flexibility, scalability, and which is decoupled. Whilst the example above is a personal implementation, every business should be taking the same approach.

Slalom has a strong team of Data and Cloud DevOps Architects, Engineers, and Analysts who can help you make the most of your data. Slalom is a Google Cloud Premier Partner and 2019 Google Cloud Breakthrough Partner of the Year. We can also deliver the same solution in AWS as a Premier Consulting Partner with over 1,400 certifications amongst our staff, or in Azure as a Gold Microsoft Partner.

Matthew Reeve is a Solution Architect in the London Office for Slalom, specialising in Data & Analytics architecture. His certifications include AWS Certified Solutions Architect, Tableau Server Certified Professional, and Tableau Desktop Certified Associate.

Matthew is grateful to Luca Zanconato for his assistance in validating the architecture in this document.