Loading 261GB of Reddit Comments into Snowflake

I used to load reddit comments onto BigQuery, now it’s time to upgrade my pipelines to Snowflake — and to share some of the nice surprises I found. Let’s get started with 261GB of them.

Querying for the subreddits with the most comments

Loading and analyzing reddit comments in Snowflake is fun, and a great way to show its powers. These are some of the Snowflake features that delighted me, that I didn’t get with BigQuery:

  • Working with semi-structured data in Snowflake is fast, easy, and fun.
  • Snowflake understands JSON objects on load, and optimizes the syntax and storage for them — without the need to predefine a schema.
  • Snowflake stores data compressed — in this case with a ratio better than 1:10 compared with the original files.
  • Snowflake supports advanced SQL syntax for recursive queries — which is awesome to analyze threaded comments.
  • Snowflake supports not only JavaScript UDFs, but also Java UDFs and external functions — which opens up a world of possibilities within the database.

Download and recompress into S3

Jason Baumgartner shares through PushShift.io billions of reddit comments and submissions. These are the archives that I’ll load into Snowflake, but first I need to download them.

For example, all the reddit comments from December 2020 live in a 19GB file:

Monthly reddit comments archive by PushShift.io

Download

The first problem to get these files is how slow this server’s bandwidth is. Each file is taking me around 17 hours of downloading on a VM in AWS EC2:

17 hours to download a 16GB file

Instead of downloading all these files for hours, we could share the results of this whole process within Snowflake and the Data Cloud — but someone needs to get this process started. So let’s keep working on this.

By the way, I launched a t4g.medium instance on EC2 to pull this off. These Arm-based AWS Graviton2 “instances accumulate CPU credits when a workload is operating below the baseline threshold and utilize credits when running above the baseline threshold”. This means the instance “accumulates” CPU power while the files are being downloaded, and then these “credits” are used to make the re-compression part way faster than the baseline.

As an implementation note: I tried the GCP Storage Transfer Service to download these files. An advantage to this approach is that you don’t need to deploy an instance to perform these lengthy file transfers, but it kept failing.

GCP Storage Transfer Service: Good service, didn’t work.

Re-compress

The next challenge is that this big 18GB .zst file needs to be split into smaller .gz compressed files. After a little bit of research, this was my chosen CLI pipeline to pull this off in one line:

unzstd --long=31 -c ../RC_2020-12.zst -  | split -C 1000000000 -d -a4 - reddit_comments_202012_ --filter='gzip > $FILE.gz'# https://stackoverflow.com/a/68718176/132438
unzstd+split+gzip at work

Push to S3 and Snowflake

Because I chose a Graviton2 VM, the easiest way to make the files available to Snowflake was to move them to S3 (snowsql is not yet available on Arm):

aws s3 cp --recursive . s3://my-s3/202108/reddit/

Then Snowflake can see these files by following the docs to create a storage integration:

use role accountadmin;create or replace storage integration s3_int_reddit
type = external_stage
storage_provider = s3
enabled = true
storage_aws_role_arn = 'arn:aws:iam::4815162342:role/my_role'
storage_allowed_locations = ('s3://my-s3/');
desc integration s3_int_reddit;grant create stage on schema public to role sysadmin;
grant usage on integration s3_int_reddit to role sysadmin;
use role sysadmin;create or replace stage my_ext_stage_reddit
url='s3://my-s3/'
storage_integration = s3_int_reddit;

list @my_ext_stage_reddit;

Now we can ingest these .json.gz files from S3 into a Snowflake native table:

create or replace table reddit_comments_sample(v variant)
;
copy into reddit_comments_sample
from @my_ext_stage_reddit/202108/reddit/
file_format = (type=json)
;

What’s interesting about this table on Snowflake:

  • Snowflake natively understands json objects, and stores them in a semi-columnar format.
  • Snowflake auto-detect files compressed with gzip.
  • The destination table has only one column that will store the JSON object with the VARIANT type.
  • Snowflake stores data compressed. For example, 140GB of raw json encoded comments use only 11.8GB of storage once in Snowflake.

Snowflake offers built-in functions and SQL extensions for traversing, flattening, and nesting of semi-structured data, with support for popular formats such as JSON and Avro. Automatic schema discovery and columnar storage make operations on schema-less, semi-structured data nearly as fast as over plain relational data, without any user effort. White Paper for SIGMOD (by Snowflake founders Thierry, Benoit and Team, 2016)

Fun with SQL

Once the comments are ingested into our Snowflake table, we can start having fun with SQL.

While writing this I’m still waiting for the unzstd+split+gz process to finish, so I ran these tests with only the first 140 gzip files (each with 100GB of comments, per our instructions to split).

Let’s see how many comments and time ranges this table contains:

select count(*) comments
, min(v:created_utc)::timestamp since
, max(v:created_utc)::timestamp until
from reddit_comments_sample;
# 84,448,007 COMMENTS
# 2020-11-30 16:00:00.000 SINCE
# 2020-12-14 04:40:50.000 UNTIL

This query ran in 0.116 seconds, and you can immediately notice some convenient SQL syntax in Snowflake:

  • v is the name we gave to the column containing our variant.
  • :created_utc is the name of the key that reddit chose to contain the timestamp in seconds since epoch. Snowflake allows us to navigate JSON objects easily with this syntax.
  • ::timestamp is how we cast in Snowflake the UNIX time to a regular timestamp. This syntax is easy to type and understand.

To see all the keys in each reddit JSON comment object, we can use some SQL to flatten it:

select key, value
from (
select *
from reddit_comments_sample
limit 1
), table(flatten(v))
limit 100;
This JSON object has 48 easy-to-navigate keys with Snowflake

We can write a query like this to count the number of comments for the most popular subreddits and the average comment score:

select v:subreddit::string sub, count(*) c, avg(v:score) avg_score
from reddit_comments_sample
group by 1
order by 2 desc
limit 10
Subreddits with the most comments, first half of December 2020, and average score

And we can get a quick visualization within the new Snowflake UI (Snowsight):

Subreddits with the most comments, first half of December 2020

This query to look for the subreddits with most awards is interesting:

select v:subreddit::string sub
, count_if(v:all_awardings > '[]') count_awards
, count_awards/count(*) ratio_with_awards
, sum(v:all_awardings[0]:coin_price) sum_coin_price
, avg(v:all_awardings[0]:coin_price) avg_coin_price
from reddit_comments_sample
group by sub
order by 2 desc
limit 100
Comments on /r/conservative have an above average 0.9% chance to get reddit gold

What’s interesting in the query syntax:

  • v:all_awardings[0]:coin_price shows how to navigate the original JSON object with SQL — as Snowflake manages to navigate and store semi-structured data smartly.
  • count_awards gets defined as a column first, and then this alias can be laterally used in the same select to get count_awards/count(*).

And instead of writing a query, Snowsight lets us quickly re-sort to find the subreddits within this top 100 with the highest chance of comment gold:

Subreddits with the highest chance for comments receiving reddit gold

Update: Now the whole month is in

Once I got the whole month out of pushshift.io, I was able to load it. Total storage in Snowflake is 19.9GB, compared to 261GB uncompressed. That’s a 1:13 compression ratio. It took 6 minutes to ingest 191 million comments using an L warehouse:

create or replace table reddit_comments_202012(v variant)
cluster by (v:subreddit::string, v:created_utc::timestamp)
;
copy into reddit_comments_202012
from @my_ext_stage_reddit/202108/reddit/
file_format = (type=json)
;
6 minutes to load and cluster 191,317,162 comments on an L warehouse

Next steps

As seen above, once the data is in Snowflake, we can have fun!

Language detection with Java UDFs

For example, an experiment detecting the language of each comment with a Java UDF over some random 20,000 comments shows us the most common languages on reddit — other than English:

Most common languages detected on 20,000 reddit comments (other than English)

Recursive queries

Snowflake supports the CONNECT BY syntax to traverse trees. For example, to recursively find the longest threads on /r/dataisbeautiful you can use this query:

with data as (
select *
from reddit_comments_sample
where v:subreddit = 'dataisbeautiful'
)
select *
from (
select level, v:parent_id, sys_connect_by_path(v:author, ' -> ')
from data
start with substr(v:parent_id, 1, 3) = 't3_'
connect by prior v:id = substr(v:parent_id, 4)
)
order by level desc
limit 100
Finding long threads on /r/dataisbeautiful

Dynamic data masking and more

Let’s save some fun for future posts. I’ll do my best to load the full reddit corpus from PushShift.io into Snowflake, and share more interesting use cases!

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.