Ingestions in DBT: How to Load Data from REST APIs with Snowpark
We all know the status quo in the modern data stack: Ingestion tools for ingestions and transformation tools for transformations. Use Fivetran, Airbyte or Meltano to get the data in, then run dbt models to transform them. But do we always need ingestion tools? Why should we subscribe to another SaaS platform or deploy an additional software component if we only deal with standard file ingestions and Web APIs?
Dbt can do the entire ELT process alone, and I will show you how.
Keep it Simple, Stupid
While we always want to use the best tool for the job, sometimes, in the data world, good enough, boring, and simple beats the perfect combination of 10+ components. If dbt-core and Snowflake can manage our entire data pipeline, why introduce additional complexity? With dbt alone, we can have all of our ingestion and transformation codes, job history, logs, and schedules in a single location — like in dbt cloud — even if we have to do a little bit — just a little — preparation and coding.
Ingesting Github data to Snowflake with dbt
Let’s consider an easy use case: I want to visualize the dbt-core
project’s traction looking at the number of commits over its lifetime, using dbt and Snowflake only. Since dbt — and especially dbt cloud — by design, don’t allow you to run code on their platforms, the only option is to invoke HTTP calls on the backend database level. In one of my previous posts, I shared my StarSnow project, which calls external HTTP endpoints from SQL as external functions — precisely what we need here.
In our case, we need to interface with the commits endpoint of the GitHub API. The only input parameter we have to pass is the repository's name — dbt-core
in our example. The GitHub API will then return the commit information in JSON format, one semi-structured entry per commit.
The process flow is illustrated in the diagram below. The steps we have to execute:
- Call StarSnow external function from our dbt model with the GitHub API specific request object (URL, HTTP method, headers, etc.)
- Snowflake will retrieve the response object for our StarSnow request
- dbt wraps the results in a
CREATE TABLE
statement and materializes the result
It is worth mentioning that for any flat file ingestions, we can simply use SnowPipe — it’s well-integrated with dbt. And honestly, with web services and flat file ingestions, we can cover many use cases.
Now let’s make things happen. First, install StarSnow, then create a model that invokes the commit
endpoint, as shown here:
In thestg_commits.sql
model I issue a select statement to read from the StarSnow external function. Dbt will turn this into a CREATE TABLE
statement and store the results in stg_commits
table. The table will contain one row and one column only with a VARIANT
type. This field will store the data returned by the GitHub API in JSON format.
Now we can build our model that flattens this JSON object, moving each commit to separate rows. Since Snowflake is great at flattening and pivoting semi-structured data, we can combine the lateral flatten
and the colon
operator to format the response object to our desired table format: one row per commit with the necessary fields pivoted to individual columns.
The results look great, well, except for the number of rows. It seems we have only 30 rows. What went wrong?
Snowpark enters the game.
Well, life is not just single GET requests. We forgot to implement pagination, which is needed if we want more than a few dozen records from our source systems. Implementing pagination might suck in SQL statements, so wouldn’t it be great to use python code to implement these control structures?
Dbt Python models
The good news is that starting from version 1.3 we can use Python programming language to build models in dbt. Our model will have .py
file extension instead of .sql
, and will deploy and will be execute as a Snowpark function. In Snowflake’s case, a dbt python model must implement a single model
function with dbt
and database session
objects as parameters and return a data frame as the execution result.
All we have to do is to rewrite stg_commits.sql
from SQL to python code and add pagination logic. Luckily, Snowpark code can also easily invoke external functions and other UDFs using call_udf
function, allowing to use StarSnow HTTP library natively.
All-in-all, the following model will do what we need:
The code above does just what you expect:
- Define a few constants (dbt v1.3 does not allow jinja templates for python models yet)
- Start an iteration to capture pages one by one (
for
cycle) - Define a data frame with the URL to invoke and request parameters
- Execute
STARSNOW_REQUEST
with parameters defined in the data frame - Store the results in
ret
variable. Useunion_all
to concatenate pages. - Break from the iteration in case there is no next page mentioned in the response
Now let’s try to execute it end-to-end:
It works like a charm. Finally, I can see the commit trends of dbt-core in my Snowsight UI:
If you want to try this demo, all codes are pushed to the https://github.com/tfoldi/dbt-github-api-ingestion repository. Don’t forget to deploy StarSnow to your AWS account before you run the models. It should work out of the box for Amazon, but hasn’t been tested on GCP or Azure — as always, surprises could happen.
Next Steps
Well, the possibilities are endless. For instance, with a little effort, we could port Singer taps to Snowpark, allowing dbt to interface seamlessly with hundreds of source systems . Imagine the possibilities.
Also, while this post heavily relies on StarSnow and Snowflake, there is no reason why you should not be able to do the same on Amazon Redshift with external functions or Google Big Query with remote functions.
Solutions like this allow the community to easily transform dbt into a full-fledged ELT solution. I’m sure it will happen.
REACH OUT TO STARSCHEMA HERE:
READ MORE FROM STARSCHEMA: