Loading all Hacker News comments into Snowflake (in less than 1 minute)

This is a quick demo to show how to export the Hacker News archive from BigQuery into Snowflake. It will be quick, easy, and a great way to show off semi-structured data support in Snowflake, and its SQL recursive capabilities.

Felipe Hoffa
Dec 10, 2021 · 5 min read
Image generated by AI (by author)

From BigQuery to Snowflake in less than 1 minute

Hacker News in BigQuery

11GB of data, within almost 30 million rows (by author)
Export to GCS, Parquet+SNAPPY (by author)
Exported in 16 seconds (by author)

Import to Snowflake (even cross cloud)

list @fh_gcp_stage
pattern='hn-20211123.*';
47 Parquet files, ~156MB each (by author)
create or replace table hacker_news(v variant);copy into hacker_news
from @fh_gcp_stage/hn-20211123/
pattern='hn-20211123/hn-.*'
file_format = (type='PARQUET');
Loaded in 43 seconds (by author)
Loaded in 28 seconds (by author)

Key differences once the data is in Snowflake

Compressed data

select bytes, row_count
from temp.information_schema.tables
where table_name='HACKER_NEWS'
Table has been compressed to less than half the size (by author)

Querying semi-structured data

select *
from hacker_news
where v:id = '25978000';
Searching for a specific story (by author)
select *
from hacker_news
where v:parent = '25978000';
15 top level comments for this arbitrary story id (by author)

Recursive queries

with data as (
select v:by author, v:time t, v:parent parent, v:id id, v:type type, v:title title, substr(v:text, 0, 20)||'' text
from hacker_news
)
select author, t::timestamp, text
, sys_connect_by_path(id,'->') path
--, connect_by_root title
from data
start with id = 25978000
connect by parent = prior id
and type='comment'
;
Recursive results, finding comments on all depths (by author)
Recursive query profile (by author)

In summary

Thanks to

Want more?

Snowflake

Articles for engineers, by engineers.