Bringing Data to the House

How to Use Snowflake to Continuously Ingest Data from a REST API

DALLE 2: Complex data pipelines moving data into Snowflake data warehouse done in the style of a Talking Heads album cover.

This guide will summarize the steps taken to load data from a third party REST API, the Strava API, into Snowflake. This will be Part 1 of a series, where I document the process of building a Fitness Tracking Application. For a high level architecture overview of the application, see this diagram. This diagram is a living document and will be updated as the application progresses.

I am surprised when I speak to customers who don’t know about the External Network Access feature. But then again, Snowflake continues to release new features and products at an eye-watering pace. Sometimes I find it challenging to keep up with the pace of innovation, but then again, that’s what keeps my job exciting!

So what is the External Network Access feature? Well, to explain why it’s so great and important to what this application does, we need to go back briefly in history — Promise this won’t take long!

As is the case with all analytics use cases, the data being analyzed must be consolidated somewhere, but as many of us know, that data is never just in one source, sometimes it’s not clean, and each one has its own little preferences about how you can authenticate with it, how often you can access it, and so on and so forth. Thanks to APIs or Application Programming Interfaces, data has become uniformly accessible and available across a wide range of sources, but the challenge is integrating with these.

Working in cloud technology, I think about this scene more often than I care to admit.

In addition, databases and data warehouses are completely separate from these sources, which means that engineers would need to write code, build scripts, or purchase software to bridge the gap between source and destination.

Why the External Network Access feature is exciting to me as a developer is because it reduces the number of tools needed to access many third party data sources to just a single platform: Snowflake.

So let’s get into it!

The first step to accessing our data is to create a new Strava application, to get the developer key and the client id.

Strava makes this very simple to do and has clear documentation on how to get started. Because the Strava API uses OAuth 2.0, we will need to follow a two step authorization flow to first generate the authorization code with our necessary scope or permissions to access the application’s data, then generate the secret access token, which grants us repeated access to the APIs endpoints. We can also use this token to generate a refresh token, which is needed every so often (after the original token expires).

For testing purposes, I chose to first use Postman to make sure I had all of the right information to call the API. I then created a python script to locally validate that I could access the API as expected. If you want to check out this code, I’ll link the repo here! Once I was able to get access to my athlete profile and activities locally, I was ready to move my application into Snowflake.

The Snowflake External Network Access feature will essentially allow us to simplify the code in these python scripts into fewer lines of code and give us the ability to obfuscate and control access to secrets using RBAC. External Network Access supports several authorization methods, but for Strava, we’ll create the following objects to manage the OAuth 2.0 flow:

Take a look at the code below to see how it’s all pulled together!

// Playing around with External Network Access
USE ROLE ACCOUNTADMIN;
USE DATABASE {MY_DATABASE};

CREATE OR REPLACE NETWORK RULE strava_api_network_rule
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('www.strava.com:443');

CREATE OR REPLACE SECRET strava_client_id
TYPE = GENERIC_STRING
SECRET_STRING = 'Replace with your client id';

CREATE OR REPLACE SECRET strava_client_secret
TYPE = GENERIC_STRING
SECRET_STRING = 'Replace with your client secret';

CREATE OR REPLACE SECURITY INTEGRATION strava_oauth
TYPE = API_AUTHENTICATION
AUTH_TYPE = OAUTH2
OAUTH_CLIENT_ID = 'Replace with your client id'
OAUTH_CLIENT_SECRET = 'Replace with your client secret'
OAUTH_TOKEN_ENDPOINT = 'https://www.strava.com/oauth/token?client_id={Replace with your client id}&client_secret={Replace with your client secret}'
OAUTH_AUTHORIZATION_ENDPOINT = 'https://www.strava.com/oauth/authorize'
OAUTH_GRANT = 'AUTHORIZATION_CODE'
OAUTH_ALLOWED_SCOPES = ('activity:read_all')
ENABLED = TRUE;

CREATE OR REPLACE SCHEMA SECRET_SCHEMA;
USE SCHEMA SECRET_SCHEMA;
CREATE OR REPLACE SECRET strava_oauth_token
TYPE = oauth2
API_AUTHENTICATION = strava_oauth;

SELECT SYSTEM$START_OAUTH_FLOW('{MY_DATABASE}.secret_schema.strava_oauth_token');

SELECT SYSTEM$FINISH_OAUTH_FLOW('Replace with url after "state" generated from SYSTEM$START_OAUTH_FLOW');

CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION strava_access_integration
ALLOWED_NETWORK_RULES = (strava_api_network_rule)
ALLOWED_AUTHENTICATION_SECRETS = (strava_oauth_token)
ENABLED = TRUE;

GRANT READ ON SECRET strava_oauth_token TO ROLE {MY_ROLE};
GRANT USAGE ON SCHEMA secret_schema TO ROLE {MY_ROLE};
GRANT USAGE ON INTEGRATION strava_access_integration TO ROLE {MY_ROLE};
GRANT EXECUTE TASK ON ACCOUNT TO ROLE {MY_ROLE};

⭐️ Best Practice ⭐️

A few things learned here to point out! It is important when you define the NETWORK RULE to allow egress traffic out of Snowflake, that the “VALUE_LIST” matches the format of the endpoint URL, both in terms of hostname but also port number. For example, a call to an encrypted end point, “https”, Snowflake will attempt to call out to port 443 and this should be defined in the list of allowed network values. Second, in case of the Strava API the client_id & client_secret are expected in the header of each call.

As a final step, I will create a User Defined Function, in python, which will access my Strava data using the newly created External Access Integration object.

use schema data;
CREATE OR REPLACE FUNCTION get_strava_activities_by_date_range(startDate timestamp_ntz, endDate timestamp_ntz)
RETURNS VARIANT
LANGUAGE PYTHON
RUNTIME_VERSION = 3.8
HANDLER = 'get_activities_by_date_range'
EXTERNAL_ACCESS_INTEGRATIONS = (strava_access_integration)
PACKAGES = ('snowflake-snowpark-python','requests')
SECRETS = ('cred' = secret_schema.strava_oauth_token )
AS
$$
import _snowflake
import requests
import json
import datetime, calendar
from datetime import timedelta

session = requests.Session()
def get_activities_by_date_range(start, end):

start_time = datetime.datetime(start.year, start.month, start.day, 0, 0, 0)
end_time = datetime.datetime(end.year, end.month, end.day, 0, 0, 0)
start_epoch = calendar.timegm(start_time.timetuple())
end_epoch = calendar.timegm(end_time.timetuple())

token = _snowflake.get_oauth_access_token('cred')
url = "https://www.strava.com/api/v3/athlete/activities?after=" + str(start_epoch) + "&before=" + str(end_epoch)
page = 1
activities_json = True
activities_obj = []

# iterate through pages until no more activities
while activities_json:
activities_json = session.get(url + f"&page={page}", headers = {"Authorization": "Bearer " + token}).json()
activities_obj.extend(activities_json)
page+=1

if len(activities_obj) == 0:
return None

return json.dumps(activities_obj)
$$;

If you would like to grant this object to a lower level role, which you will most likely need to do since most of these objects require the ACCOUNTADMIN role to create anyways, then you can follow the steps outlined in this section of the example walkthrough.

We can start to preview the data by calling the function directly, but ultimately it would be more optimal and performant if we saved the data to Snowflake to be able to query on demand. To do this, I’ll create a Snowflake Task to automate the process of pulling the previous day’s activities into a strava_activities_raw, where I’ll store the json response, then I can slice and dice it by any dimension or metric later on.

⭐️ Best Practice ⭐️

This is a very practical and common approach that many customers use to load data quickly and reduce friction brought on by changing source schemas!

Finally, I want to start asking a few questions of the data to get a good idea of how I can use it. What have my activities looked like over time? What month has been the most active to date? These are very high level questions and as we start to bring in more sources, we can develop better questions or even use a model to make predictions.

One of my long term goals is to reduce my average mile time from 10:00 minutes/mile to around 9:30 and this could be a great use case to apply the FORECAST function to. FORECAST is a part of Snowflake’s new Cortex ML engine and is a built-in time series model for generating predictive analytics. Another good way to visualize this data is with Snowflake’s built-in Application GUI, Streamlit.

In a few short lines of code, we can see that activities are trending steadily upwards, which aligns with one of my other goals which is basically just to always be doing more!

Next up! I will want to be able to monitor the status of my tasks to make sure they complete successfully and if not, notify me! Snowflake handles the visualization of this already as shown here, but in the real world, a pipeline failure would need to trigger an alert of some kind.

Snowflake Monitoring of Daily Data Loading Task

Popular options for notification channels include slack, email, BI applications, and ticketing softwares.

--

--