How to ETL from web APIs into an open data lakehouse with Python, Iceberg, and Snowflake

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

13 min readApr 13, 2024

Storing raw data from APIs is a common use case for a data lake, which is essentially writing JSON files into object storage and loading into tables later. But landing JSON or Parquet files first just to rewrite into Iceberg, Delta Lake, or Hudi creates duplicative storage and processing costs. Wouldn’t it be nice to just write API payloads directly into tables to forego those extra costs? In this blog post, you’ll see an example of how Snowflake can be the ETL engine and metastore in an open data lakehouse architecture built on Apache Iceberg.

Overview of architecture

Here’s an illustration of a typical architecture for loading data from APIs into a data lake. However, data lakes lack ACID transactions and efficient pruning, which are critical for reliable analytics and good performance. To get ACID transactions and better performance, files are typically copied from a data lake into a data warehouse, creating duplicative storage cost and additional compute cost.

In order to optimize costs and ensure data operations are reliable, the ideal solution is to write directly from source(s) into tables. My next blog post shows how you can write from APIs directly into Snowflake‘s table format which has some other benefits, but use cases that place higher prioritization on compute interoperability are better suited for Iceberg tables.

Example dataset

Since we’re talking about Iceberg and web APIs, for this demo we’ll use Python to get data from the GitHub API. Specifically, we’ll get GitHub activity from Delta Lake, Apache Iceberg, and Apache Hudi projects. Just for demonstration, we’ll use this data to analyze and compare activity across the projects.

Let’s dig in

You’ll need a few things in order to follow along:

Snowflake account with access to ACCOUNTADMIN role. A free trial, Standard Edition will suffice. See #2 below for selecting your cloud provider and region.
Cloud storage account in AWS, GCP, or Azure in the same region as your Snowflake account. You’ll need the ability to create and edit IAM roles/policies.
A GitHub account.

Create GitHub personal access token

This demo will make a few thousand requests to the GitHub API, which exceeds the rate limit of 60 requests per hour for unauthenticated users. If you don’t already have one, create a GitHub personal access token with the following scope:

public_repo
read:org
read:user
read:project

Create Snowflake objects

Create database, schemas, warehouse, and external volume for this demo. Because there isn’t a native Iceberg catalog service in Azure, I’ll show how Snowflake can be used as the Iceberg catalog for Azure users. I’m using Azure blob for storage, but if using Snowflake in AWS or GCP, external volume setup instructions for S3 and GCS are documented here.

CREATE DATABASE IF NOT EXISTS github;
USE DATABASE github;
CREATE SCHEMA IF NOT EXISTS raw;
CREATE SCHEMA IF NOT EXISTS curated;
CREATE SCHEMA IF NOT EXISTS analytics;
USE SCHEMA public;
CREATE WAREHOUSE IF NOT EXISTS github
    STATEMENT_TIMEOUT_IN_SECONDS = 3600
    STATEMENT_QUEUED_TIMEOUT_IN_SECONDS = 3600;

CREATE OR REPLACE EXTERNAL VOLUME iceberg_vol
    STORAGE_LOCATIONS = 
        (
            (
                NAME = '<your name>'
                STORAGE_PROVIDER = 'AZURE'
                STORAGE_BASE_URL = 'azure://<storage-account>.blob.core.windows.net/<container>/'
                AZURE_TENANT_ID = '<your azure tenant ID>'
                
            )
        );

Regardless of cloud or region, there are some additional steps needed to finish setting up the external volume in your storage account and Snowflake. Full instructions are here.

To allow Snowflake to call the GitHub API endpoints directly, create a network rule, secret, and external access integration.

CREATE OR REPLACE NETWORK RULE github.public.gh_nw_rule
    MODE = EGRESS
    TYPE = HOST_PORT
    VALUE_LIST = ('api.github.com');

-- Store GitHub personal access tokens for API calls
CREATE OR REPLACE SECRET github.public.gh_token
    TYPE = GENERIC_STRING
    SECRET_STRING = '<your github token>';

-- Create integration using previous objects
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION gh_integration
    ALLOWED_NETWORK_RULES = (github.public.gh_nw_rule)
    ALLOWED_AUTHENTICATION_SECRETS = (github.public.gh_token)
    ENABLED = true;

Create Python functions

Create a Python UDF for getting all of the latest repositories (“repo”), and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_projects()
    RETURNS TABLE (repo STRING)
    LANGUAGE PYTHON
    RUNTIME_VERSION = 3.11
    HANDLER = 'load_projects'
    EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
    PACKAGES = ('requests','simplejson')
AS
$$
import _snowflake
import requests
import simplejson as json

class load_projects:
    def process(self):
        response = requests.get('https://api.github.com/orgs/delta-io/repos')
        delta_json_obj = json.loads(response.text)
        delta_proj_repos = []
        for i in delta_json_obj:
            delta_proj_repos.append(i['full_name'])
            
        all_proj_repos = ['apache/iceberg','apache/hudi']
        all_proj_repos.extend(delta_proj_repos)
    
        for repo in all_proj_repos:
            yield (repo,)
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.repos (
    repo STRING
)
    CATALOG = 'SNOWFLAKE'
    EXTERNAL_VOLUME = 'iceberg_vol'
    BASE_LOCATION = 'raw/repos/'
    AS
    SELECT *
    FROM TABLE(github.public.load_projects());

You will now see data and metadata files in your storage account.

Iceberg table’s metadata files in Azure blob storage

Iceberg table’s Parquet data files in Azure blob storage

Create a Python UDTF looping through each repo and getting all of the contributors, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_project_contributors(repo STRING)
    RETURNS TABLE (login STRING, url STRING, contributions INTEGER)
    LANGUAGE PYTHON
    RUNTIME_VERSION = 3.11
    HANDLER = 'load_project_contributors'
    EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
    PACKAGES = ('requests','simplejson')
    SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests
import simplejson as json

class load_project_contributors:
    def process(self, repo):
        url = 'https://api.github.com/repos/'+str(repo)+'/contributors?per_page=100&page=1'
        mytoken = _snowflake.get_generic_secret_string('cred')
        response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
        contributors = response.json()
        contributors_list = []

        for contributor in contributors:
            contributors_list.append({
                "login": contributor["login"],
                "url": contributor["url"],
                "contributions": contributor["contributions"]
            })
        
        while 'next' in response.links.keys():
            response = requests.get(response.links['next']['url'],headers={"Authorization": 'token '+ mytoken})
            contributors = response.json()
                
            for contributor in contributors:
                contributors_list.append({
                    "login": contributor["login"],
                    "url": contributor["url"],
                    "contributions": contributor["contributions"]
                })
        
        for contributor in contributors_list:
            yield (contributor['login'], contributor['url'], contributor['contributions'])
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.all_project_contributors (
    repo STRING,
    login STRING,
    url STRING,
    contributions INTEGER
)
    CATALOG = 'SNOWFLAKE'
    EXTERNAL_VOLUME = 'iceberg_vol'
    BASE_LOCATION = 'raw/all_project_contributors/'
    AS
    SELECT
        repo,
        c.*
    FROM github.raw.repos,
         TABLE(github.public.load_project_contributors(repo)) c;

Create a Python UDTF looping through each contributor and getting more profile metadata, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_profile(url STRING)
    RETURNS TABLE (login STRING, company STRING, email STRING, created_at STRING, updated_at STRING)
    LANGUAGE PYTHON
    RUNTIME_VERSION = 3.11
    HANDLER = 'load_profile'
    EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
    PACKAGES = ('requests')
    SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests

class load_profile:
    def process(self, url):
        mytoken = _snowflake.get_generic_secret_string('cred')
        response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
        profile = response.json()
    
        login = profile['login']
        company = profile['company']
        email = profile['email']
        created_at = profile['created_at']
        updated_at = profile['updated_at']
        
        yield(login, company, email, created_at, updated_at)
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.unique_profiles (
    login STRING,
    company STRING,
    email STRING,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
)
    CATALOG = 'SNOWFLAKE'
    EXTERNAL_VOLUME = 'iceberg_vol'
    BASE_LOCATION = 'raw/unique_profiles/'
    AS
    SELECT
         p.login,
         p.company,
         p.email,
         p.created_at::TIMESTAMP,
         p.updated_at::TIMESTAMP
    FROM (SELECT DISTINCT url
          FROM github.raw.all_project_contributors),
          TABLE(github.public.load_profile(url)) p;

Create a Python UDTF for looping through each repo and getting all of the pull requests, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_project_pulls(repo STRING)
    RETURNS TABLE (id INTEGER, number INTEGER, created_at STRING, updated_at STRING, merged_at STRING, state STRING, login STRING)
    LANGUAGE PYTHON
    RUNTIME_VERSION = 3.11
    HANDLER = 'load_project_pulls'
    EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
    PACKAGES = ('requests','simplejson')
    SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests
import simplejson as json

class load_project_pulls:
    def process(self, repo):
        url = 'https://api.github.com/repos/'+str(repo)+'/pulls?state=all&per_page=100&page=1'
        mytoken = _snowflake.get_generic_secret_string('cred')
        response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
        pulls = response.json()
        pulls_list = []
          
        for pull in pulls:
            pulls_list.append({
                "id": pull["id"],
                "number": pull["number"],
                "state": pull["state"],
                "created_at": pull["created_at"],
                "updated_at": pull["updated_at"],
                "merged_at": pull["merged_at"],
                "login": pull["user"]["login"]
            })
        
        while 'next' in response.links.keys():
            response = requests.get(response.links['next']['url'],headers={"Authorization": 'token '+ mytoken})
            pulls = response.json()
            
            for pull in pulls:
                pulls_list.append({
                    "id": pull["id"],
                    "number": pull["number"],
                    "state": pull["state"],
                    "created_at": pull["created_at"],
                    "updated_at": pull["updated_at"],
                    "merged_at": pull["merged_at"],
                    "login": pull["user"]["login"]
                })
        
        for pull in pulls_list:
            yield (pull['id'], pull['number'], pull['created_at'], pull['updated_at'], pull['merged_at'], pull['state'], pull['login'])
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.all_project_pulls (
    repo STRING,
    id INTEGER,
    number INTEGER,
    created_at TIMESTAMP,
    updated_at TIMESTAMP,
    merged_at TIMESTAMP,
    state STRING,
    login STRING
)
    CATALOG = 'SNOWFLAKE'
    EXTERNAL_VOLUME = 'iceberg_vol'
    BASE_LOCATION = 'raw/all_project_pulls/'
    AS;
    INSERT OVERWRITE INTO github.raw.all_project_pulls
    SELECT
         r.repo,
         p.id,
         p.number,
         p.created_at::TIMESTAMP,
         p.updated_at::TIMESTAMP,
         p.merged_at::TIMESTAMP,
         state,
         login
    FROM (SELECT repo
          FROM github.raw.repos) r,
          TABLE(github.public.load_project_pulls(repo)) p;

Create a Python UDTF for looping through each repo and getting all of the commits, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_project_commits(repo STRING)
    RETURNS TABLE (id STRING, committed_at STRING, author_login STRING, committer_login STRING)
    LANGUAGE PYTHON
    RUNTIME_VERSION = 3.11
    HANDLER = 'load_project_commits'
    EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
    PACKAGES = ('requests','simplejson')
    SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests
import simplejson as json

class load_project_commits:
    def process(self, repo):
        url = 'https://api.github.com/repos/'+str(repo)+'/commits?per_page=100&page=1'
        mytoken = _snowflake.get_generic_secret_string('cred')
        response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
        commits = response.json()
        commits_list = []
        
        for commit in commits:
            author_login = commit["author"]["login"] if commit["author"] else None
            committer_login = commit["committer"]["login"] if commit["committer"] else None
            
            commits_list.append({
                "id": commit["sha"],
                "author_login": author_login,
                "committer_login": committer_login,
                "committed_at": commit["commit"]["author"]["date"]
            })
                  
        while 'next' in response.links.keys():
            response = requests.get(response.links['next']['url'],headers={"Authorization": 'token '+ mytoken})
            commits = response.json()
            
            for commit in commits:
                author_login = commit["author"]["login"] if commit["author"] else None
                committer_login = commit["committer"]["login"] if commit["committer"] else None
                
                commits_list.append({
                    "id": commit["sha"],
                    "author_login": author_login,
                    "committer_login": committer_login,
                    "committed_at": commit["commit"]["author"]["date"]
                })
        
        for commit in commits_list:
            yield (commit['id'], commit['committed_at'], commit['author_login'], commit['committer_login'])
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.all_project_commits (
    repo STRING,
    id STRING,
    committed_at TIMESTAMP,
    author_login STRING,
    committer_login STRING
)
    CATALOG = 'SNOWFLAKE'
    EXTERNAL_VOLUME = 'iceberg_vol'
    BASE_LOCATION = 'raw/all_project_commits/'
    AS
    SELECT
         r.repo,
         c.id,
         c.committed_at::TIMESTAMP,
         c.author_login,
         c.committer_login
    FROM (SELECT repo
          FROM github.raw.repos) r,
          TABLE(github.public.load_project_commits(repo)) c;

Schedule and orchestrate Python functions with Tasks

You can use Tasks to schedule and orchestrate a pipeline using the functions created to fetch the latest data. The pipeline schedule specified in the task definitions below are on the first of every month, but you can change to whatever CRON schedule you’d like. The downstream tasks are setup to run only after the upstream task completes.

CREATE OR REPLACE TASK github.public.repos
    WAREHOUSE = github
    SCHEDULE = 'USING CRON 0 0 1 * * America/Los_Angeles'
    AS
        INSERT OVERWRITE INTO github.raw.repos
            SELECT *
            FROM TABLE(github.public.load_projects());

-- Task to load contributors data when the github.public.repos task completes getting updated list of repos
CREATE OR REPLACE TASK github.public.all_project_contributors
    WAREHOUSE = github
    AFTER github.public.repos
    AS
        INSERT OVERWRITE INTO github.raw.all_project_contributors
            SELECT
                repo,
                c.*
            FROM github.raw.repos,
                 TABLE(github.public.load_project_contributors(repo)) c;

-- Task to load profile data when the github.public.all_proj_contributors task completes getting updated list of contributors
CREATE OR REPLACE TASK github.public.unique_profiles
    WAREHOUSE = github
    AFTER github.public.all_project_contributors
    AS
        INSERT OVERWRITE INTO github.raw.unique_profiles
            SELECT
                 p.login,
                 p.company,
                 p.email,
                 p.created_at::TIMESTAMP,
                 p.updated_at::TIMESTAMP
            FROM (SELECT DISTINCT url
                  FROM github.raw.all_project_contributors),
                  TABLE(github.public.load_profile(url)) p;

-- Task to load pull request data when the github.public.repos task completes getting updated list of repos
CREATE OR REPLACE TASK github.public.all_project_pulls
    WAREHOUSE = github
    AFTER github.public.repos
    AS
        INSERT OVERWRITE INTO github.raw.all_project_pulls
            SELECT
                 r.repo,
                 p.id,
                 p.number,
                 p.created_at::TIMESTAMP,
                 p.updated_at::TIMESTAMP,
                 p.merged_at::TIMESTAMP,
                 state,
                 login
            FROM (SELECT repo
                  FROM github.raw.repos) r,
                  TABLE(github.public.load_project_pulls(repo)) p;

-- Task to load commit data when the github.public.repos task completes getting updated list of repos
CREATE OR REPLACE TASK github.public.all_project_commits
    WAREHOUSE = github
    AFTER github.public.repos
    AS
        INSERT OVERWRITE INTO github.raw.all_project_commits
            SELECT
                 r.repo,
                 c.id,
                 c.committed_at::TIMESTAMP,
                 c.author_login,
                 c.committer_login
            FROM (SELECT repo
                  FROM github.raw.repos) r,
                  TABLE(github.public.load_project_commits(repo)) c;

-- By default, new tasks are in a suspended state. Activate or "resume" the full DAG as follows
SELECT SYSTEM$TASK_DEPENDENTS_ENABLE('github.public.repos');

After creating these tasks, you can see them in a directed acyclic graph (“DAG”) along with a run history.

Create curated views

Now that tables are loaded with raw data, you can do additional cleaning/processing into more usable views.

-- Add project grouping of repos to contributors
CREATE OR REPLACE VIEW github.curated.all_project_contributors AS
    SELECT
        CASE
            WHEN repo = 'apache/iceberg' THEN 'iceberg'
            WHEN repo = 'apache/hudi' THEN 'hudi'
            ELSE 'delta'
        END AS project,
        *
    FROM github.raw.all_project_contributors;

-- Parsing out clean company string from profiles table
CREATE OR REPLACE VIEW github.curated.unique_profiles AS
    WITH base AS (
        SELECT
            login,
            RTRIM(REPLACE(LOWER(company), '@', '')) AS company,
            email,
            SPLIT_PART(email, '@', 2) AS email_domain,
            created_at,
            updated_at,
        FROM github.raw.unique_profiles
    )
    SELECT
        login,
        email,
        email_domain,
        created_at,
        updated_at,
        CASE
            WHEN CONTAINS(base.company,'databricks') = TRUE THEN 'databricks'
            WHEN CONTAINS(base.company,'tabular') = TRUE THEN 'tabular'
            WHEN CONTAINS(base.company,'netflix') = TRUE THEN 'netflix'
            WHEN CONTAINS(base.company,'adobe') = TRUE THEN 'adobe'
            WHEN CONTAINS(base.company,'alibaba') = TRUE THEN 'alibaba'
            WHEN CONTAINS(base.company,'amazon') = TRUE THEN 'amazon'
            WHEN CONTAINS(base.company,'aws') = TRUE THEN 'amazon'
            WHEN CONTAINS(base.company,'apple') = TRUE THEN 'apple'
            WHEN CONTAINS(base.company,'bytedance') = TRUE THEN 'bytedance'
            WHEN CONTAINS(base.company,'cisco') = TRUE THEN 'cisco'
            WHEN CONTAINS(base.company,'cloudera') = TRUE THEN 'cloudera'
            WHEN CONTAINS(base.company,'cockroach') = TRUE THEN 'cockroach'
            WHEN CONTAINS(base.company,'dremio') = TRUE THEN 'dremio'
            WHEN CONTAINS(base.company,'ebay') = TRUE THEN 'ebay'
            WHEN CONTAINS(base.company,'eventual-inc') = TRUE THEN 'eventual-inc'
            WHEN CONTAINS(base.company,'freelance') = TRUE THEN 'freelance'
            WHEN CONTAINS(base.company,'huawei') = TRUE THEN 'huawei'
            WHEN CONTAINS(base.company,'ibm') = TRUE THEN 'ibm'
            WHEN CONTAINS(base.company,'linkedin') = TRUE THEN 'linkedin'
            WHEN CONTAINS(base.company,'microsoft') = TRUE THEN 'microsoft'
            WHEN CONTAINS(base.company,'netease') = TRUE THEN 'netease'
            WHEN CONTAINS(base.company,'salesforce') = TRUE THEN 'salesforce'
            WHEN CONTAINS(base.company,'selectdb') = TRUE THEN 'selectdb'
            WHEN CONTAINS(base.company,'snowflake') = TRUE THEN 'snowflake'
            WHEN CONTAINS(base.company,'starburst') = TRUE THEN 'starburst'
            WHEN CONTAINS(base.company,'tencent') = TRUE THEN 'tencent'
            WHEN CONTAINS(base.company,'uber') = TRUE THEN 'uber'
            WHEN CONTAINS(base.company,'apache') = TRUE THEN 'apache'
            ELSE base.company
        END AS gh_company,
        CASE
            WHEN CONTAINS(base.email_domain,'amazon') = TRUE THEN 'amazon'
            WHEN CONTAINS(base.email_domain,'apache') = TRUE THEN 'apache'
            WHEN CONTAINS(base.email_domain,'apple') = TRUE THEN 'apple'
            WHEN CONTAINS(base.email_domain,'bytedance') = TRUE THEN 'bytedance'
            WHEN CONTAINS(base.email_domain,'cloudera') = TRUE THEN 'cloudera'
            WHEN CONTAINS(base.email_domain,'databricks') = TRUE THEN 'databricks'
            WHEN CONTAINS(base.email_domain,'ebay') = TRUE THEN 'ebay'
            WHEN CONTAINS(base.email_domain,'intel') = TRUE THEN 'intel'
            WHEN CONTAINS(base.email_domain,'linkedin') = TRUE THEN 'linkedin'
            WHEN CONTAINS(base.email_domain,'microsoft') = TRUE THEN 'microsoft'
            WHEN CONTAINS(base.email_domain,'robinhood') = TRUE THEN 'robinhood'
            WHEN CONTAINS(base.email_domain,'tabular') = TRUE THEN 'tabular'
            WHEN CONTAINS(base.email_domain,'tencent') = TRUE THEN 'tencent'
            WHEN CONTAINS(base.email_domain,'uber') = TRUE THEN 'uber'
        END AS email_company,
        COALESCE(gh_company, email_company) AS company
    FROM base;

-- Add project-level grouping of repos from pull requests table
CREATE OR REPLACE VIEW github.curated.all_project_pulls AS
    SELECT
        CASE
            WHEN repo = 'apache/iceberg' THEN 'iceberg'
            WHEN repo = 'apache/hudi' THEN 'hudi'
            ELSE 'delta'
        END AS project,
        *
    FROM github.raw.all_project_pulls;

-- Add project-level grouping of commits, coalesce logins
CREATE OR REPLACE VIEW github.curated.all_project_commits AS
    SELECT
        CASE
            WHEN repo = 'apache/iceberg' THEN 'iceberg'
            WHEN repo = 'apache/hudi' THEN 'hudi'
            ELSE 'delta'
        END AS project,
        *,
        COALESCE(author_login, committer_login) AS login
    FROM github.raw.all_project_commits;

Create analytical views and charts

Now with some additional fields and cleaning, we can more easily perform aggregations into views for analytics.

-- Total number of pull requests by project
CREATE OR REPLACE VIEW github.analytics.pulls_by_project AS
    SELECT
        project,
        COUNT(*) AS num_pulls
    FROM github.curated.all_project_pulls
    GROUP BY 1
    ORDER BY 2 DESC;

-- Running total number of pull requests by project per day
CREATE OR REPLACE VIEW github.analytics.pulls_by_project_day AS
    WITH base AS (
        SELECT
            project,
            DATE_TRUNC('DAY', created_at) AS created_at,
            COUNT(*) AS num_pulls,
        FROM github.curated.all_project_pulls
        GROUP BY 1, 2
        ORDER BY 1, 2
    )
    SELECT
        project,
        created_at,
        num_pulls,
        SUM(num_pulls) OVER (PARTITION BY project ORDER BY created_at) AS running_sum_pulls
    FROM base
    GROUP BY 1, 2, 3
    ORDER BY 1, 2;

-- Unique committers by company per project
CREATE OR REPLACE VIEW github.analytics.project_company_committers AS
    WITH project_company_committers AS (
        SELECT
            c.project,
            p.company,
            COUNT(DISTINCT c.login) AS company_committers
        FROM github.curated.all_project_commits c
        JOIN github.curated.unique_profiles p
            ON c.login = p.login
        GROUP BY 1, 2
        ORDER BY 1, 3 DESC
    ), project_committers AS (
        SELECT
            project,
            SUM(company_committers) AS project_committers
        FROM project_company_committers
        GROUP BY 1
    )
    SELECT
        c.project,
        c.company,
        c.company_committers,
        p.project_committers,
        c.company_committers / p.project_committers * 100 AS pct_committers
    FROM project_company_committers c
    JOIN project_committers p
        ON c.project = p.project
    ORDER BY 1, 5 DESC;

You can explore the results of queries as charts directly in Snowflake’s web UI (“Snowsight”). Let’s start by comparing the total pull requests over time for the three projects.

SELECT * FROM github.analytics.pulls_by_project_day;

As of April 13, 2024, it appears Apache Hudi has the most pull requests at 8,039, followed by Apache Iceberg at 7,061 and Delta Lake at 4,187.

Pull requests are just proposed changes to a repo, but commits are accepted changes. If you look at a list of distinct people who have made at least one commit to each project, what percentage works at which companies?

SELECT
    company,
    pct_committers
FROM github.analytics.project_company_committers
WHERE project = 'iceberg'
ORDER BY 2 DESC;

Percent of unique committers for Apache Iceberg by company

It looks like some of the companies with the highest percentage of committers in Iceberg are Amazon and the Apache Software Foundation at 4%, Tencent, Apple, Netflix, Dremio, and a long tail of others. Company association couldn’t be identified for 44% of Apache Iceberg committers.

SELECT
    company,
    pct_committers
FROM github.analytics.project_company_committers
WHERE project = 'hudi'
ORDER BY 2 DESC;

Percent of unique committers for Apache Hudi by company

The companies with the highest percent of committers are Apache Software Foundation at 4%, Amazon, ByteDance, Tencent, Alibaba, Uber, and OneHouse at 1–2%, and a long tail of others. Company association couldn’t be identified for 63% of Apache Hudi committers.

SELECT
    company,
    pct_committers
FROM github.analytics.project_company_committers
WHERE project = 'delta'
ORDER BY 2 DESC;

Percent of unique committers for Delta Lake by company

For Delta Lake, Databricks appears as the primary contributor at 22% of all project committers. This could be interpreted as a positive sign that Databricks heavily invests in Delta Lake, or as a negative sign that other companies are unmotivated or unable to contribute at the same level as Databricks. Interpretation aside, the distribution of committers by company is noticeably different for Delta Lake in comparison to Apache Iceberg and Apache Hudi.

Wrapping up

To summarize, this example demonstrates two points:

Snowflake supports an open data lakehouse architecture with Apache Iceberg. Snowflake can be the ETL engine and metastore while storing data in your bucket in an fully open source format.
Features like External Network Access and Snowpipe Streaming allow you to cut out the costs of writing and storing raw files in object storage THEN copying into tables.

Related things to check out

If you’re interested in exploring what else you can do with Iceberg on Snowflake, check out:

Quickstart guide – Getting started with Iceberg Tables
Blog post – How to use LLMs on an open lakehouse with Snowflake Cortex and Apache Iceberg

In my next blog post, I use a similar pattern showing how you can write API data directly to tables, but instead store nested JSON payloads in Snowflake’s VARIANT data type without as much upfront schema definition.