Building a Data Pipeline for my own Strava Data
EtLT of my own Strava data using the Strava API, MySQL, Python, S3, Redshift, and Airflow
I build an EtLT pipeline to ingest my Strava data from the Strava API and load it into a Redshift data warehouse. This pipeline is then run once a week using Airflow to extract any new activity data. The end goal is then to use this data warehouse to build an automatically updating dashboard in Tableau and also to trigger automatic re-training of my Strava Kudos Prediction model.
Data Extraction
My personal Strava activity data is first ingested incrementally using the Strava API and loaded into an S3 bucket. On each ingestion run, we query a MySQL database to get the date of the last extraction:
We then make repeated calls to the REST API using the requests library until we have all activity data between now and last_updated_warehosue. We include a time.sleep() command to comply with Strava’s set rate limit of 100 requests/15 minutes. We also include try: except: blocks to combat
missing data on certain activities.
Before exporting the data locally into a flat pipe-delimited .csv file, we perform a few minor transformations such as formatting dates and timezone columns. Hence the little ‘t’ in EtLT! After we save the data, it is then uploaded to an S3 bucket for later loading into the data warehouse.
Finally, we execute a query to update the MySQL database on the last date of extraction.
Data Loading
Once the data is loaded into the S3 data lake it is then loaded into our Redshift data warehouse. We load the data in two parts:
- We first load the data from the S3 bucket into a staging table with the same schema as our production table
- We then perform validation tests between the staging table and the production table (see Data Validation section). If all critical tests pass we then remove all duplicates between the two tables by first deleting them from the production table. The data from the staging table is then fully inserted into the production table.
Data Validation
We implement a simple framework in python that is used to execute SQL-based data validation checks in our data pipeline. Although it lacks many features we would expect to see in a production environment, it is a good start and provides some insight into how we can improve our infrastructure.
The validator.py script executes a pair of SQL scripts on Redshift and compares the two based on a comparison operator (>, <, =). The test then passes or fails based on the outcome of the two executed scripts. We execute this validation step after we upload our newly ingested data to the staging table but before we insert this table into the production table.
As a starting point, I implemented checks that check for duplicates, compare the distribution of the total activities in the staging table (Airflow is set to execute at the end of each week) to the average historical weekly activity count, and compare the distribution of the Kudos Count metric to the historical distribution using the z-score. In other words, the last two queries check if the values are within a 90% confidence interval in either direction of what’s expected based on history. For example, the following query computes the z-score for the total activities uploaded in a given week (found in the staging table).
By running
python src/validator.py sql/validation/weekly_activity_count_zscore.sql sql/validation/zscore_90_twosided.sql greater_equals warnin the terminal, we compare this z-score found in the previous query to the 90% confidence interval z-score SELECT 1.645;. The ‘warn’ at the end of the command tells the script not to exit with an error but to warn us instead. On the other hand, if we add ‘halt’ to the end the script will exit with an error code and halt all further downstream tasks.
We also implement a system to send a notification to a given Slack channel with the validation test results, this validation system was inspired by the Data Validation in Pipelines chapter of James Densmore’s excellent Data Pipelines book.
We then combine all the tests to a shell script validate_load_data.sh that we run after loading the data from the S3 bucket to a staging table but before we insert this data into the production table. Running this pipeline on last week's data gives us the following output:
It’s great to see that our second test failed because I didn’t run anywhere near as much last week as I usually do!
Although this validation framework is very basic, it is a good foundation that can be built upon at a later date.
Data Transformations
Now the data has been ingested into the data warehouse, the next step in the pipeline is data transformations. Data transformations in this case include both noncontextual manipulation of the data and modeling of the data with context and logic in mind. The benefit of using the ELT methodology instead of the ETL framework, in this case, is that it gives us, the end-user, the freedom to transform the data the way we need as opposed to having a fixed data model that we cannot change (or at least not change without hassle). In my case, I am connecting my Redshift data warehouse to Tableau building out a dashboard. We can, for example, build a data model to extract monthly statistics:
We can also build more complicated data models. For example, we can get the week-by-week percentage change in total weekly kudos broken down by workout type:
A further direction to take this would be to utilize a 3rd party tool such as dbt to implement data modeling.
Putting it All Together with Airflow
We create a DAG to orchestrate our data pipeline. We set the pipeline to run weekly which means it will run once a week at midnight on Sunday morning. As seen in the diagram below, our DAG will:
- First, extract any recent data using the Strava API and upload it to an S3 bucket
- It will then load this data into a staging table in our Redshift cluster
- The 3 validation tests will then be executed, messaging our Slack channel the results
- The staging table will then be inserted into the production table, removing any duplicates in the process
- Finally, a monthly aggregation data model will be created in a new table
activity_summary_monthly
Data Visualization
With the data transformations done we were then able to build out an interactive dashboard using Tableau that updates automatically when new data gets ingested to the data warehouse, which is weekly. The dashboard I created was built to investigate how Kudos on my Strava activities changes over time and location. After building this project I shut down the Redshift server to not incur any costs but a screenshot of the dashboard can be seen below.
Unit Testing
Unit testing was performed using PyTest and all tests can be found in the tests directory. For example, below we see a unit test to test the make_strava_api_request function. It asserts that a dictionary response is received and also that the response contains an ‘id’ key that is an integer.
Further Directions and Considerations
- Improve Airflow with Docker: I could have used the docker image of Airflow to run the pipeline in a Docker container which would’ve made things more robust. This would also make deploying the pipeline at scale much easier!
- Implement more validation tests: For a real production pipeline, I would implement more validation tests all through the pipeline. I could, for example, have used an open-source tool like Great Expectations.
- Simplify the process: The pipeline could probably be run in a much simpler way. An alternative could be to use Cron for orchestration and PostgreSQL or SQLite for storage.
- Data streaming: To keep the Dashboard consistently up to date we could benefit from something like Kafka.
Concluding Remarks
To conclude, I build a data pipeline to extract, load, and transform my Strava data automatically. This pipeline is run once a week and automatically updates an interactive dashboard.
I hope you enjoyed it!

