How to ETL from web APIs into an open data lakehouse with Python, Iceberg, and Snowflake

Storing raw data from APIs is a common use case for a data lake, which is essentially writing JSON files into object storage and loading into tables later. But landing JSON or Parquet files first just to rewrite into Iceberg, Delta Lake, or Hudi creates duplicative storage and processing costs. Wouldn’t it be nice to just write API payloads directly into tables to forego those extra costs? In this blog post, you’ll see an example of how Snowflake can be the ETL engine and metastore in an open data lakehouse architecture built on Apache Iceberg.

Overview of architecture

Here’s an illustration of a typical architecture for loading data from APIs into a data lake. However, data lakes lack ACID transactions and efficient pruning, which are critical for reliable analytics and good performance. To get ACID transactions and better performance, files are typically copied from a data lake into a data warehouse, creating duplicative storage cost and additional compute cost.

In order to optimize costs and ensure data operations are reliable, the ideal solution is to write directly from source(s) into tables. My next blog post shows how you can write from APIs directly into Snowflake‘s table format which has some other benefits, but use cases that place higher prioritization on compute interoperability are better suited for Iceberg tables.

Example dataset

Since we’re talking about Iceberg and web APIs, for this demo we’ll use Python to get data from the GitHub API. Specifically, we’ll get GitHub activity from Delta Lake, Apache Iceberg, and Apache Hudi projects. Just for demonstration, we’ll use this data to analyze and compare activity across the projects.

Let’s dig in

You’ll need a few things in order to follow along:

  1. Snowflake account with access to ACCOUNTADMIN role. A free trial, Standard Edition will suffice. See #2 below for selecting your cloud provider and region.
  2. Cloud storage account in AWS, GCP, or Azure in the same region as your Snowflake account. You’ll need the ability to create and edit IAM roles/policies.
  3. A GitHub account.

Create GitHub personal access token

This demo will make a few thousand requests to the GitHub API, which exceeds the rate limit of 60 requests per hour for unauthenticated users. If you don’t already have one, create a GitHub personal access token with the following scope:

  • public_repo
  • read:org
  • read:user
  • read:project

Create Snowflake objects

Create database, schemas, warehouse, and external volume for this demo. Because there isn’t a native Iceberg catalog service in Azure, I’ll show how Snowflake can be used as the Iceberg catalog for Azure users. I’m using Azure blob for storage, but if using Snowflake in AWS or GCP, external volume setup instructions for S3 and GCS are documented here.

CREATE DATABASE IF NOT EXISTS github;
USE DATABASE github;
CREATE SCHEMA IF NOT EXISTS raw;
CREATE SCHEMA IF NOT EXISTS curated;
CREATE SCHEMA IF NOT EXISTS analytics;
USE SCHEMA public;
CREATE WAREHOUSE IF NOT EXISTS github
STATEMENT_TIMEOUT_IN_SECONDS = 3600
STATEMENT_QUEUED_TIMEOUT_IN_SECONDS = 3600;

CREATE OR REPLACE EXTERNAL VOLUME iceberg_vol
STORAGE_LOCATIONS =
(
(
NAME = '<your name>'
STORAGE_PROVIDER = 'AZURE'
STORAGE_BASE_URL = 'azure://<storage-account>.blob.core.windows.net/<container>/'
AZURE_TENANT_ID = '<your azure tenant ID>'

)
);

Regardless of cloud or region, there are some additional steps needed to finish setting up the external volume in your storage account and Snowflake. Full instructions are here.

To allow Snowflake to call the GitHub API endpoints directly, create a network rule, secret, and external access integration.

CREATE OR REPLACE NETWORK RULE github.public.gh_nw_rule
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('api.github.com');

-- Store GitHub personal access tokens for API calls
CREATE OR REPLACE SECRET github.public.gh_token
TYPE = GENERIC_STRING
SECRET_STRING = '<your github token>';

-- Create integration using previous objects
CREATE OR REPLACE EXTERNAL ACCESS INTEGRATION gh_integration
ALLOWED_NETWORK_RULES = (github.public.gh_nw_rule)
ALLOWED_AUTHENTICATION_SECRETS = (github.public.gh_token)
ENABLED = true;

Create Python functions

Create a Python UDF for getting all of the latest repositories (“repo”), and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_projects()
RETURNS TABLE (repo STRING)
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
HANDLER = 'load_projects'
EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
PACKAGES = ('requests','simplejson')
AS
$$
import _snowflake
import requests
import simplejson as json

class load_projects:
def process(self):
response = requests.get('https://api.github.com/orgs/delta-io/repos')
delta_json_obj = json.loads(response.text)
delta_proj_repos = []
for i in delta_json_obj:
delta_proj_repos.append(i['full_name'])

all_proj_repos = ['apache/iceberg','apache/hudi']
all_proj_repos.extend(delta_proj_repos)

for repo in all_proj_repos:
yield (repo,)
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.repos (
repo STRING
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_vol'
BASE_LOCATION = 'raw/repos/'
AS
SELECT *
FROM TABLE(github.public.load_projects());

You will now see data and metadata files in your storage account.

Iceberg table’s metadata files in Azure blob storage
Iceberg table’s Parquet data files in Azure blob storage

Create a Python UDTF looping through each repo and getting all of the contributors, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_project_contributors(repo STRING)
RETURNS TABLE (login STRING, url STRING, contributions INTEGER)
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
HANDLER = 'load_project_contributors'
EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
PACKAGES = ('requests','simplejson')
SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests
import simplejson as json

class load_project_contributors:
def process(self, repo):
url = 'https://api.github.com/repos/'+str(repo)+'/contributors?per_page=100&page=1'
mytoken = _snowflake.get_generic_secret_string('cred')
response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
contributors = response.json()
contributors_list = []

for contributor in contributors:
contributors_list.append({
"login": contributor["login"],
"url": contributor["url"],
"contributions": contributor["contributions"]
})

while 'next' in response.links.keys():
response = requests.get(response.links['next']['url'],headers={"Authorization": 'token '+ mytoken})
contributors = response.json()

for contributor in contributors:
contributors_list.append({
"login": contributor["login"],
"url": contributor["url"],
"contributions": contributor["contributions"]
})

for contributor in contributors_list:
yield (contributor['login'], contributor['url'], contributor['contributions'])
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.all_project_contributors (
repo STRING,
login STRING,
url STRING,
contributions INTEGER
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_vol'
BASE_LOCATION = 'raw/all_project_contributors/'
AS
SELECT
repo,
c.*
FROM github.raw.repos,
TABLE(github.public.load_project_contributors(repo)) c;

Create a Python UDTF looping through each contributor and getting more profile metadata, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_profile(url STRING)
RETURNS TABLE (login STRING, company STRING, email STRING, created_at STRING, updated_at STRING)
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
HANDLER = 'load_profile'
EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
PACKAGES = ('requests')
SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests

class load_profile:
def process(self, url):
mytoken = _snowflake.get_generic_secret_string('cred')
response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
profile = response.json()

login = profile['login']
company = profile['company']
email = profile['email']
created_at = profile['created_at']
updated_at = profile['updated_at']

yield(login, company, email, created_at, updated_at)
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.unique_profiles (
login STRING,
company STRING,
email STRING,
created_at TIMESTAMP,
updated_at TIMESTAMP
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_vol'
BASE_LOCATION = 'raw/unique_profiles/'
AS
SELECT
p.login,
p.company,
p.email,
p.created_at::TIMESTAMP,
p.updated_at::TIMESTAMP
FROM (SELECT DISTINCT url
FROM github.raw.all_project_contributors),
TABLE(github.public.load_profile(url)) p;

Create a Python UDTF for looping through each repo and getting all of the pull requests, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_project_pulls(repo STRING)
RETURNS TABLE (id INTEGER, number INTEGER, created_at STRING, updated_at STRING, merged_at STRING, state STRING, login STRING)
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
HANDLER = 'load_project_pulls'
EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
PACKAGES = ('requests','simplejson')
SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests
import simplejson as json

class load_project_pulls:
def process(self, repo):
url = 'https://api.github.com/repos/'+str(repo)+'/pulls?state=all&per_page=100&page=1'
mytoken = _snowflake.get_generic_secret_string('cred')
response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
pulls = response.json()
pulls_list = []

for pull in pulls:
pulls_list.append({
"id": pull["id"],
"number": pull["number"],
"state": pull["state"],
"created_at": pull["created_at"],
"updated_at": pull["updated_at"],
"merged_at": pull["merged_at"],
"login": pull["user"]["login"]
})

while 'next' in response.links.keys():
response = requests.get(response.links['next']['url'],headers={"Authorization": 'token '+ mytoken})
pulls = response.json()

for pull in pulls:
pulls_list.append({
"id": pull["id"],
"number": pull["number"],
"state": pull["state"],
"created_at": pull["created_at"],
"updated_at": pull["updated_at"],
"merged_at": pull["merged_at"],
"login": pull["user"]["login"]
})

for pull in pulls_list:
yield (pull['id'], pull['number'], pull['created_at'], pull['updated_at'], pull['merged_at'], pull['state'], pull['login'])
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.all_project_pulls (
repo STRING,
id INTEGER,
number INTEGER,
created_at TIMESTAMP,
updated_at TIMESTAMP,
merged_at TIMESTAMP,
state STRING,
login STRING
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_vol'
BASE_LOCATION = 'raw/all_project_pulls/'
AS;
INSERT OVERWRITE INTO github.raw.all_project_pulls
SELECT
r.repo,
p.id,
p.number,
p.created_at::TIMESTAMP,
p.updated_at::TIMESTAMP,
p.merged_at::TIMESTAMP,
state,
login
FROM (SELECT repo
FROM github.raw.repos) r,
TABLE(github.public.load_project_pulls(repo)) p;

Create a Python UDTF for looping through each repo and getting all of the commits, and the Iceberg table to store the results.

CREATE OR REPLACE FUNCTION github.public.load_project_commits(repo STRING)
RETURNS TABLE (id STRING, committed_at STRING, author_login STRING, committer_login STRING)
LANGUAGE PYTHON
RUNTIME_VERSION = 3.11
HANDLER = 'load_project_commits'
EXTERNAL_ACCESS_INTEGRATIONS = (gh_integration)
PACKAGES = ('requests','simplejson')
SECRETS = ('cred' = github.public.gh_token)
AS
$$
import _snowflake
import requests
import simplejson as json

class load_project_commits:
def process(self, repo):
url = 'https://api.github.com/repos/'+str(repo)+'/commits?per_page=100&page=1'
mytoken = _snowflake.get_generic_secret_string('cred')
response = requests.get(url,headers={"Authorization": 'token '+ mytoken})
commits = response.json()
commits_list = []

for commit in commits:
author_login = commit["author"]["login"] if commit["author"] else None
committer_login = commit["committer"]["login"] if commit["committer"] else None

commits_list.append({
"id": commit["sha"],
"author_login": author_login,
"committer_login": committer_login,
"committed_at": commit["commit"]["author"]["date"]
})

while 'next' in response.links.keys():
response = requests.get(response.links['next']['url'],headers={"Authorization": 'token '+ mytoken})
commits = response.json()

for commit in commits:
author_login = commit["author"]["login"] if commit["author"] else None
committer_login = commit["committer"]["login"] if commit["committer"] else None

commits_list.append({
"id": commit["sha"],
"author_login": author_login,
"committer_login": committer_login,
"committed_at": commit["commit"]["author"]["date"]
})

for commit in commits_list:
yield (commit['id'], commit['committed_at'], commit['author_login'], commit['committer_login'])
$$;

CREATE OR REPLACE ICEBERG TABLE github.raw.all_project_commits (
repo STRING,
id STRING,
committed_at TIMESTAMP,
author_login STRING,
committer_login STRING
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'iceberg_vol'
BASE_LOCATION = 'raw/all_project_commits/'
AS
SELECT
r.repo,
c.id,
c.committed_at::TIMESTAMP,
c.author_login,
c.committer_login
FROM (SELECT repo
FROM github.raw.repos) r,
TABLE(github.public.load_project_commits(repo)) c;

Schedule and orchestrate Python functions with Tasks

You can use Tasks to schedule and orchestrate a pipeline using the functions created to fetch the latest data. The pipeline schedule specified in the task definitions below are on the first of every month, but you can change to whatever CRON schedule you’d like. The downstream tasks are setup to run only after the upstream task completes.

CREATE OR REPLACE TASK github.public.repos
WAREHOUSE = github
SCHEDULE = 'USING CRON 0 0 1 * * America/Los_Angeles'
AS
INSERT OVERWRITE INTO github.raw.repos
SELECT *
FROM TABLE(github.public.load_projects());

-- Task to load contributors data when the github.public.repos task completes getting updated list of repos
CREATE OR REPLACE TASK github.public.all_project_contributors
WAREHOUSE = github
AFTER github.public.repos
AS
INSERT OVERWRITE INTO github.raw.all_project_contributors
SELECT
repo,
c.*
FROM github.raw.repos,
TABLE(github.public.load_project_contributors(repo)) c;

-- Task to load profile data when the github.public.all_proj_contributors task completes getting updated list of contributors
CREATE OR REPLACE TASK github.public.unique_profiles
WAREHOUSE = github
AFTER github.public.all_project_contributors
AS
INSERT OVERWRITE INTO github.raw.unique_profiles
SELECT
p.login,
p.company,
p.email,
p.created_at::TIMESTAMP,
p.updated_at::TIMESTAMP
FROM (SELECT DISTINCT url
FROM github.raw.all_project_contributors),
TABLE(github.public.load_profile(url)) p;

-- Task to load pull request data when the github.public.repos task completes getting updated list of repos
CREATE OR REPLACE TASK github.public.all_project_pulls
WAREHOUSE = github
AFTER github.public.repos
AS
INSERT OVERWRITE INTO github.raw.all_project_pulls
SELECT
r.repo,
p.id,
p.number,
p.created_at::TIMESTAMP,
p.updated_at::TIMESTAMP,
p.merged_at::TIMESTAMP,
state,
login
FROM (SELECT repo
FROM github.raw.repos) r,
TABLE(github.public.load_project_pulls(repo)) p;

-- Task to load commit data when the github.public.repos task completes getting updated list of repos
CREATE OR REPLACE TASK github.public.all_project_commits
WAREHOUSE = github
AFTER github.public.repos
AS
INSERT OVERWRITE INTO github.raw.all_project_commits
SELECT
r.repo,
c.id,
c.committed_at::TIMESTAMP,
c.author_login,
c.committer_login
FROM (SELECT repo
FROM github.raw.repos) r,
TABLE(github.public.load_project_commits(repo)) c;

-- By default, new tasks are in a suspended state. Activate or "resume" the full DAG as follows
SELECT SYSTEM$TASK_DEPENDENTS_ENABLE('github.public.repos');

After creating these tasks, you can see them in a directed acyclic graph (“DAG”) along with a run history.

Task DAG and run history

Create curated views

Now that tables are loaded with raw data, you can do additional cleaning/processing into more usable views.

-- Add project grouping of repos to contributors
CREATE OR REPLACE VIEW github.curated.all_project_contributors AS
SELECT
CASE
WHEN repo = 'apache/iceberg' THEN 'iceberg'
WHEN repo = 'apache/hudi' THEN 'hudi'
ELSE 'delta'
END AS project,
*
FROM github.raw.all_project_contributors;

-- Parsing out clean company string from profiles table
CREATE OR REPLACE VIEW github.curated.unique_profiles AS
WITH base AS (
SELECT
login,
RTRIM(REPLACE(LOWER(company), '@', '')) AS company,
email,
SPLIT_PART(email, '@', 2) AS email_domain,
created_at,
updated_at,
FROM github.raw.unique_profiles
)
SELECT
login,
email,
email_domain,
created_at,
updated_at,
CASE
WHEN CONTAINS(base.company,'databricks') = TRUE THEN 'databricks'
WHEN CONTAINS(base.company,'tabular') = TRUE THEN 'tabular'
WHEN CONTAINS(base.company,'netflix') = TRUE THEN 'netflix'
WHEN CONTAINS(base.company,'adobe') = TRUE THEN 'adobe'
WHEN CONTAINS(base.company,'alibaba') = TRUE THEN 'alibaba'
WHEN CONTAINS(base.company,'amazon') = TRUE THEN 'amazon'
WHEN CONTAINS(base.company,'aws') = TRUE THEN 'amazon'
WHEN CONTAINS(base.company,'apple') = TRUE THEN 'apple'
WHEN CONTAINS(base.company,'bytedance') = TRUE THEN 'bytedance'
WHEN CONTAINS(base.company,'cisco') = TRUE THEN 'cisco'
WHEN CONTAINS(base.company,'cloudera') = TRUE THEN 'cloudera'
WHEN CONTAINS(base.company,'cockroach') = TRUE THEN 'cockroach'
WHEN CONTAINS(base.company,'dremio') = TRUE THEN 'dremio'
WHEN CONTAINS(base.company,'ebay') = TRUE THEN 'ebay'
WHEN CONTAINS(base.company,'eventual-inc') = TRUE THEN 'eventual-inc'
WHEN CONTAINS(base.company,'freelance') = TRUE THEN 'freelance'
WHEN CONTAINS(base.company,'huawei') = TRUE THEN 'huawei'
WHEN CONTAINS(base.company,'ibm') = TRUE THEN 'ibm'
WHEN CONTAINS(base.company,'linkedin') = TRUE THEN 'linkedin'
WHEN CONTAINS(base.company,'microsoft') = TRUE THEN 'microsoft'
WHEN CONTAINS(base.company,'netease') = TRUE THEN 'netease'
WHEN CONTAINS(base.company,'salesforce') = TRUE THEN 'salesforce'
WHEN CONTAINS(base.company,'selectdb') = TRUE THEN 'selectdb'
WHEN CONTAINS(base.company,'snowflake') = TRUE THEN 'snowflake'
WHEN CONTAINS(base.company,'starburst') = TRUE THEN 'starburst'
WHEN CONTAINS(base.company,'tencent') = TRUE THEN 'tencent'
WHEN CONTAINS(base.company,'uber') = TRUE THEN 'uber'
WHEN CONTAINS(base.company,'apache') = TRUE THEN 'apache'
ELSE base.company
END AS gh_company,
CASE
WHEN CONTAINS(base.email_domain,'amazon') = TRUE THEN 'amazon'
WHEN CONTAINS(base.email_domain,'apache') = TRUE THEN 'apache'
WHEN CONTAINS(base.email_domain,'apple') = TRUE THEN 'apple'
WHEN CONTAINS(base.email_domain,'bytedance') = TRUE THEN 'bytedance'
WHEN CONTAINS(base.email_domain,'cloudera') = TRUE THEN 'cloudera'
WHEN CONTAINS(base.email_domain,'databricks') = TRUE THEN 'databricks'
WHEN CONTAINS(base.email_domain,'ebay') = TRUE THEN 'ebay'
WHEN CONTAINS(base.email_domain,'intel') = TRUE THEN 'intel'
WHEN CONTAINS(base.email_domain,'linkedin') = TRUE THEN 'linkedin'
WHEN CONTAINS(base.email_domain,'microsoft') = TRUE THEN 'microsoft'
WHEN CONTAINS(base.email_domain,'robinhood') = TRUE THEN 'robinhood'
WHEN CONTAINS(base.email_domain,'tabular') = TRUE THEN 'tabular'
WHEN CONTAINS(base.email_domain,'tencent') = TRUE THEN 'tencent'
WHEN CONTAINS(base.email_domain,'uber') = TRUE THEN 'uber'
END AS email_company,
COALESCE(gh_company, email_company) AS company
FROM base;

-- Add project-level grouping of repos from pull requests table
CREATE OR REPLACE VIEW github.curated.all_project_pulls AS
SELECT
CASE
WHEN repo = 'apache/iceberg' THEN 'iceberg'
WHEN repo = 'apache/hudi' THEN 'hudi'
ELSE 'delta'
END AS project,
*
FROM github.raw.all_project_pulls;

-- Add project-level grouping of commits, coalesce logins
CREATE OR REPLACE VIEW github.curated.all_project_commits AS
SELECT
CASE
WHEN repo = 'apache/iceberg' THEN 'iceberg'
WHEN repo = 'apache/hudi' THEN 'hudi'
ELSE 'delta'
END AS project,
*,
COALESCE(author_login, committer_login) AS login
FROM github.raw.all_project_commits;

Create analytical views and charts

Now with some additional fields and cleaning, we can more easily perform aggregations into views for analytics.

-- Total number of pull requests by project
CREATE OR REPLACE VIEW github.analytics.pulls_by_project AS
SELECT
project,
COUNT(*) AS num_pulls
FROM github.curated.all_project_pulls
GROUP BY 1
ORDER BY 2 DESC;

-- Running total number of pull requests by project per day
CREATE OR REPLACE VIEW github.analytics.pulls_by_project_day AS
WITH base AS (
SELECT
project,
DATE_TRUNC('DAY', created_at) AS created_at,
COUNT(*) AS num_pulls,
FROM github.curated.all_project_pulls
GROUP BY 1, 2
ORDER BY 1, 2
)
SELECT
project,
created_at,
num_pulls,
SUM(num_pulls) OVER (PARTITION BY project ORDER BY created_at) AS running_sum_pulls
FROM base
GROUP BY 1, 2, 3
ORDER BY 1, 2;

-- Unique committers by company per project
CREATE OR REPLACE VIEW github.analytics.project_company_committers AS
WITH project_company_committers AS (
SELECT
c.project,
p.company,
COUNT(DISTINCT c.login) AS company_committers
FROM github.curated.all_project_commits c
JOIN github.curated.unique_profiles p
ON c.login = p.login
GROUP BY 1, 2
ORDER BY 1, 3 DESC
), project_committers AS (
SELECT
project,
SUM(company_committers) AS project_committers
FROM project_company_committers
GROUP BY 1
)
SELECT
c.project,
c.company,
c.company_committers,
p.project_committers,
c.company_committers / p.project_committers * 100 AS pct_committers
FROM project_company_committers c
JOIN project_committers p
ON c.project = p.project
ORDER BY 1, 5 DESC;

You can explore the results of queries as charts directly in Snowflake’s web UI (“Snowsight”). Let’s start by comparing the total pull requests over time for the three projects.

SELECT * FROM github.analytics.pulls_by_project_day;
Running total pull requests by project

As of April 13, 2024, it appears Apache Hudi has the most pull requests at 8,039, followed by Apache Iceberg at 7,061 and Delta Lake at 4,187.

Pull requests are just proposed changes to a repo, but commits are accepted changes. If you look at a list of distinct people who have made at least one commit to each project, what percentage works at which companies?

SELECT
company,
pct_committers
FROM github.analytics.project_company_committers
WHERE project = 'iceberg'
ORDER BY 2 DESC;
Percent of unique committers for Apache Iceberg by company

It looks like some of the companies with the highest percentage of committers in Iceberg are Amazon and the Apache Software Foundation at 4%, Tencent, Apple, Netflix, Dremio, and a long tail of others. Company association couldn’t be identified for 44% of Apache Iceberg committers.

SELECT
company,
pct_committers
FROM github.analytics.project_company_committers
WHERE project = 'hudi'
ORDER BY 2 DESC;
Percent of unique committers for Apache Hudi by company

The companies with the highest percent of committers are Apache Software Foundation at 4%, Amazon, ByteDance, Tencent, Alibaba, Uber, and OneHouse at 1–2%, and a long tail of others. Company association couldn’t be identified for 63% of Apache Hudi committers.

SELECT
company,
pct_committers
FROM github.analytics.project_company_committers
WHERE project = 'delta'
ORDER BY 2 DESC;
Percent of unique committers for Delta Lake by company

For Delta Lake, Databricks appears as the primary contributor at 22% of all project committers. This could be interpreted as a positive sign that Databricks heavily invests in Delta Lake, or as a negative sign that other companies are unmotivated or unable to contribute at the same level as Databricks. Interpretation aside, the distribution of committers by company is noticeably different for Delta Lake in comparison to Apache Iceberg and Apache Hudi.

Wrapping up

To summarize, this example demonstrates two points:

  1. Snowflake supports an open data lakehouse architecture with Apache Iceberg. Snowflake can be the ETL engine and metastore while storing data in your bucket in an fully open source format.
  2. Features like External Network Access and Snowpipe Streaming allow you to cut out the costs of writing and storing raw files in object storage THEN copying into tables.

Related things to check out

If you’re interested in exploring what else you can do with Iceberg on Snowflake, check out:

In my next blog post, I use a similar pattern showing how you can write API data directly to tables, but instead store nested JSON payloads in Snowflake’s VARIANT data type without as much upfront schema definition.

--

--

Scott Teal
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Product Marketing at Snowflake. Originally from Orlando, now in Seattle. Previously at Tableau.