Ingestions in DBT: How to Load Data from REST APIs with Snowpark

Tamas Foldi
HCLTech-Starschema Blog
6 min readOct 17, 2022

--

We all know the status quo in the modern data stack: Ingestion tools for ingestions and transformation tools for transformations. Use Fivetran, Airbyte or Meltano to get the data in, then run dbt models to transform them. But do we always need ingestion tools? Why should we subscribe to another SaaS platform or deploy an additional software component if we only deal with standard file ingestions and Web APIs?

Dbt can do the entire ELT process alone, and I will show you how.

Snowboarded in a Snowpark sliding on a dbt ramp
Image generated by Dall-E AI

Keep it Simple, Stupid

While we always want to use the best tool for the job, sometimes, in the data world, good enough, boring, and simple beats the perfect combination of 10+ components. If dbt-core and Snowflake can manage our entire data pipeline, why introduce additional complexity? With dbt alone, we can have all of our ingestion and transformation codes, job history, logs, and schedules in a single location — like in dbt cloud — even if we have to do a little bit — just a little — preparation and coding.

Ingesting Github data to Snowflake with dbt

Let’s consider an easy use case: I want to visualize the dbt-core project’s traction looking at the number of commits over its lifetime, using dbt and Snowflake only. Since dbt — and especially dbt cloud — by design, don’t allow you to run code on their platforms, the only option is to invoke HTTP calls on the backend database level. In one of my previous posts, I shared my StarSnow project, which calls external HTTP endpoints from SQL as external functions — precisely what we need here.

In our case, we need to interface with the commits endpoint of the GitHub API. The only input parameter we have to pass is the repository's name — dbt-core in our example. The GitHub API will then return the commit information in JSON format, one semi-structured entry per commit.

The process flow is illustrated in the diagram below. The steps we have to execute:

  1. Call StarSnow external function from our dbt model with the GitHub API specific request object (URL, HTTP method, headers, etc.)
  2. Snowflake will retrieve the response object for our StarSnow request
  3. dbt wraps the results in a CREATE TABLE statement and materializes the result
End-to-end architecture for flat file and web api ingestions with dbt only — no ingestion tools required.

It is worth mentioning that for any flat file ingestions, we can simply use SnowPipe — it’s well-integrated with dbt. And honestly, with web services and flat file ingestions, we can cover many use cases.

Now let’s make things happen. First, install StarSnow, then create a model that invokes the commit endpoint, as shown here:

In my dbt project I made separate folders under models for ingestions and transformations

In thestg_commits.sql model I issue a select statement to read from the StarSnow external function. Dbt will turn this into a CREATE TABLE statement and store the results in stg_commits table. The table will contain one row and one column only with a VARIANT type. This field will store the data returned by the GitHub API in JSON format.

We have one single row with one single column for all commits. This is the response from the API call, as-is.

Now we can build our model that flattens this JSON object, moving each commit to separate rows. Since Snowflake is great at flattening and pivoting semi-structured data, we can combine the lateral flatten and the colon operator to format the response object to our desired table format: one row per commit with the necessary fields pivoted to individual columns.

Our commit model will split and flatten the API response.
The model results: we just ingested and transformed our first Web API with dbt cloud only

The results look great, well, except for the number of rows. It seems we have only 30 rows. What went wrong?

Snowpark enters the game.

Well, life is not just single GET requests. We forgot to implement pagination, which is needed if we want more than a few dozen records from our source systems. Implementing pagination might suck in SQL statements, so wouldn’t it be great to use python code to implement these control structures?

Dbt Python models

The good news is that starting from version 1.3 we can use Python programming language to build models in dbt. Our model will have .py file extension instead of .sql, and will deploy and will be execute as a Snowpark function. In Snowflake’s case, a dbt python model must implement a single model function with dbt and database session objects as parameters and return a data frame as the execution result.

Example dbt python model returning with a single row table

All we have to do is to rewrite stg_commits.sql from SQL to python code and add pagination logic. Luckily, Snowpark code can also easily invoke external functions and other UDFs using call_udf function, allowing to use StarSnow HTTP library natively.

All-in-all, the following model will do what we need:

Our stg_paginated_commits.py dbt python model implements pagination using StarSnow and Snowpark

The code above does just what you expect:

  1. Define a few constants (dbt v1.3 does not allow jinja templates for python models yet)
  2. Start an iteration to capture pages one by one (for cycle)
  3. Define a data frame with the URL to invoke and request parameters
  4. Execute STARSNOW_REQUEST with parameters defined in the data frame
  5. Store the results in ret variable. Use union_all to concatenate pages.
  6. Break from the iteration in case there is no next page mentioned in the response

Now let’s try to execute it end-to-end:

My first Snowpark web API model on dbt cloud

It works like a charm. Finally, I can see the commit trends of dbt-core in my Snowsight UI:

Something happened during the Summer of 2018

If you want to try this demo, all codes are pushed to the https://github.com/tfoldi/dbt-github-api-ingestion repository. Don’t forget to deploy StarSnow to your AWS account before you run the models. It should work out of the box for Amazon, but hasn’t been tested on GCP or Azure — as always, surprises could happen.

Next Steps

Well, the possibilities are endless. For instance, with a little effort, we could port Singer taps to Snowpark, allowing dbt to interface seamlessly with hundreds of source systems . Imagine the possibilities.

Also, while this post heavily relies on StarSnow and Snowflake, there is no reason why you should not be able to do the same on Amazon Redshift with external functions or Google Big Query with remote functions.

Solutions like this allow the community to easily transform dbt into a full-fledged ELT solution. I’m sure it will happen.

--

--

Tamas Foldi
HCLTech-Starschema Blog

Helping enterprises to become more data driven @ HCLTech, co-founder & former CEO @ Starschema