Loading all Hacker News comments into Snowflake (in less than 1 minute)

This is a quick demo to show how to export the Hacker News archive from BigQuery into Snowflake. It will be quick, easy, and a great way to show off semi-structured data support in Snowflake, and its SQL recursive capabilities.

Felipe Hoffa

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

5 min readDec 10, 2021

From BigQuery to Snowflake in less than 1 minute

Watch on YouTube

Hacker News in BigQuery

If you are a BigQuery user, you can find the Hacker News archive table on console.cloud.google.com/bigquery?p=bigquery-public-data&d=hacker_news&t=full&page=table.

This table contains an updated copy of Hacker News, as seen on my previous posts (when I used to work with the BigQuery team). Right now it has more than 11GB of data, with almost 30 million rows:

11GB of data, within almost 30 million rows (by author)

To load a copy of this table into Snowflake, we can export it in different formats to GCS — for example, Parquet:

Export to GCS, Parquet+SNAPPY (by author)

16 seconds later the operation has finished:

Import to Snowflake (even cross cloud)

On my Snowflake account I have already configured a secure integration with Google’s Cloud Storage, so it’s straightforward to check that the new files are there with a SQL query:

list @fh_gcp_stage
pattern='hn-20211123.*';

47 Parquet files, ~156MB each (by author)

Note that even when a Snowflake account lives in AWS, the integration with Google Cloud Storage is seamless.

We can create a schema-less table to load these Parquet files into a Snowflake native table:

create or replace table hacker_news(v variant);copy into hacker_news
from @fh_gcp_stage/hn-20211123/
pattern='hn-20211123/hn-.*'
file_format = (type='PARQUET');

It took 43 seconds to create this semi-structured table with an S warehouse — while also moving the data from GCP to AWS:

With an L warehouse the time goes down to 28 seconds (while moving the data from GCP to AWS too):

Our data is now ready to query.

Key differences once the data is in Snowflake

Compressed data

It’s worth noting that Snowflake charges for compressed data. When moving data from BigQuery to Snowflake, it usually compresses 4 or 5 times. In this case — since most content is text that doesn’t compress heavily in columnar format — the same table uses half the storage when in Snowflake:

select bytes, row_count
from temp.information_schema.tables
where table_name='HACKER_NEWS'

Table has been compressed to less than half the size (by author)

Querying semi-structured data

We took Parquet files into a Snowflake semi-structured table with one column of the VARIANT type. This is a special type in Snowflake that parses semi-structured data on load, and prepares it for querying. This is especially useful when loading JSON objects with different schemas, and it can help us here too.

For example, to search one specific story by id:

select *
from hacker_news
where v:id = '25978000';

Searching for a specific story (by author)

The result looks like a JSON object, but it’s more than that — you can also notice that we could navigate it with the colon in where v:id='25978000'.

To search for all comments in that story, we can use the parent attribute:

select *
from hacker_news
where v:parent = '25978000';

This finds 15 top-level comments:

15 top level comments for this arbitrary story id (by author)

Recursive queries

A hard problem in SQL is how to find all the comments that have the above comments as parents (and so on), in order to find all the comments that were left in that specific story. But that’s not a hard problem in Snowflake, thanks to the recursive SQL clause CONNECT BY:

with data as (
    select v:by author, v:time t, v:parent parent, v:id id, v:type type, v:title title, substr(v:text, 0, 20)||'' text
    from hacker_news 
)select author, t::timestamp, text
    , sys_connect_by_path(id,'->') path
    --, connect_by_root title
from data
start with id = 25978000
connect by parent = prior id 
and type='comment'
;

And that’s how we can find all comments left on this story, regardless of their depth:

Recursive results, finding comments on all depths (by author)

The query profile for these recursive results looks pretty interesting too:

For more fun analysis on these comments, stay tuned.

In summary

We were able to load 11 GB of data from BigQuery in GCP to Snowflake in AWS in less than 1 minute.
Snowflake charges storage for compressed data (instead of uncompressed).
Snowflake works seamlessly with semi-structured data in schema-less tables (thanks to the variant type).
Snowflake can navigate recursive table structures with recursive SQL support (including the connect by clause).

Thanks to

Thomas Dziedzic, for sharing his method to analyze Hacker News with Snowflake back in March 2021.

Want more?

Try this out with a Snowflake free trial account — you only need an email address to get started.

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check reddit.com/r/snowflake for the most interesting Snowflake news.