Loading 261GB of Reddit Comments into Snowflake

I used to load reddit comments onto BigQuery, now it’s time to upgrade my pipelines to Snowflake — and to share some of the nice surprises I found. Let’s get started with 261GB of them.

Felipe Hoffa
Aug 11, 2021 · 7 min read
Querying for the subreddits with the most comments

Loading and analyzing reddit comments in Snowflake is fun, and a great way to show its powers. These are some of the Snowflake features that delighted me, that I didn’t get with BigQuery:

  • Working with semi-structured data in Snowflake is fast, easy, and fun.
  • Snowflake understands JSON objects on load, and optimizes the syntax and storage for them — without the need to predefine a schema.
  • Snowflake stores data compressed — in this case with a ratio better than 1:10 compared with the original files.
  • Snowflake supports advanced SQL syntax for recursive queries — which is awesome to analyze threaded comments.
  • Snowflake supports not only JavaScript UDFs, but also Java UDFs and external functions — which opens up a world of possibilities within the database.

Download and recompress into S3

Jason Baumgartner shares through PushShift.io billions of reddit comments and submissions. These are the archives that I’ll load into Snowflake, but first I need to download them.

For example, all the reddit comments from December 2020 live in a 19GB file:

Monthly reddit comments archive by PushShift.io

The first problem to get these files is how slow this server’s bandwidth is. Each file is taking me around 17 hours of downloading on a VM in AWS EC2:

17 hours to download a 16GB file

Instead of downloading all these files for hours, we could share the results of this whole process within Snowflake and the Data Cloud — but someone needs to get this process started. So let’s keep working on this.

By the way, I launched a t4g.medium instance on EC2 to pull this off. These Arm-based AWS Graviton2 “instances accumulate CPU credits when a workload is operating below the baseline threshold and utilize credits when running above the baseline threshold”. This means the instance “accumulates” CPU power while the files are being downloaded, and then these “credits” are used to make the re-compression part way faster than the baseline.

As an implementation note: I tried the GCP Storage Transfer Service to download these files. An advantage to this approach is that you don’t need to deploy an instance to perform these lengthy file transfers, but it kept failing.

GCP Storage Transfer Service: Good service, didn’t work.

The next challenge is that this big 18GB .zst file needs to be split into smaller .gz compressed files. After a little bit of research, this was my chosen CLI pipeline to pull this off in one line:

unzstd+split+gzip at work

Because I chose a Graviton2 VM, the easiest way to make the files available to Snowflake was to move them to S3 (snowsql is not yet available on Arm):

Then Snowflake can see these files by following the docs to create a storage integration:

Now we can ingest these .json.gz files from S3 into a Snowflake native table:

What’s interesting about this table on Snowflake:

  • Snowflake natively understands json objects, and stores them in a semi-columnar format.
  • Snowflake auto-detect files compressed with gzip.
  • The destination table has only one column that will store the JSON object with the VARIANT type.
  • Snowflake stores data compressed. For example, 140GB of raw json encoded comments use only 11.8GB of storage once in Snowflake.

Snowflake offers built-in functions and SQL extensions for traversing, flattening, and nesting of semi-structured data, with support for popular formats such as JSON and Avro. Automatic schema discovery and columnar storage make operations on schema-less, semi-structured data nearly as fast as over plain relational data, without any user effort. White Paper for SIGMOD (by Snowflake founders Thierry, Benoit and Team, 2016)

Fun with SQL

Once the comments are ingested into our Snowflake table, we can start having fun with SQL.

While writing this I’m still waiting for the unzstd+split+gz process to finish, so I ran these tests with only the first 140 gzip files (each with 100GB of comments, per our instructions to split).

Let’s see how many comments and time ranges this table contains:

This query ran in 0.116 seconds, and you can immediately notice some convenient SQL syntax in Snowflake:

  • v is the name we gave to the column containing our variant.
  • :created_utc is the name of the key that reddit chose to contain the timestamp in seconds since epoch. Snowflake allows us to navigate JSON objects easily with this syntax.
  • ::timestamp is how we cast in Snowflake the UNIX time to a regular timestamp. This syntax is easy to type and understand.

To see all the keys in each reddit JSON comment object, we can use some SQL to flatten it:

This JSON object has 48 easy-to-navigate keys with Snowflake

We can write a query like this to count the number of comments for the most popular subreddits and the average comment score:

Subreddits with the most comments, first half of December 2020, and average score

And we can get a quick visualization within the new Snowflake UI (Snowsight):

Subreddits with the most comments, first half of December 2020

This query to look for the subreddits with most awards is interesting:

Comments on /r/conservative have an above average 0.9% chance to get reddit gold

What’s interesting in the query syntax:

  • v:all_awardings[0]:coin_price shows how to navigate the original JSON object with SQL — as Snowflake manages to navigate and store semi-structured data smartly.
  • count_awards gets defined as a column first, and then this alias can be laterally used in the same select to get count_awards/count(*).

And instead of writing a query, Snowsight lets us quickly re-sort to find the subreddits within this top 100 with the highest chance of comment gold:

Subreddits with the highest chance for comments receiving reddit gold

Once I got the whole month out of pushshift.io, I was able to load it. Total storage in Snowflake is 19.9GB, compared to 261GB uncompressed. That’s a 1:13 compression ratio. It took 6 minutes to ingest 191 million comments using an L warehouse:

6 minutes to load and cluster 191,317,162 comments on an L warehouse

Next steps

As seen above, once the data is in Snowflake, we can have fun!

For example, an experiment detecting the language of each comment with a Java UDF over some random 20,000 comments shows us the most common languages on reddit — other than English:

Most common languages detected on 20,000 reddit comments (other than English)

Snowflake supports the CONNECT BY syntax to traverse trees. For example, to recursively find the longest threads on /r/dataisbeautiful you can use this query:

Finding long threads on /r/dataisbeautiful

Dynamic data masking and more

Let’s save some fun for future posts. I’ll do my best to load the full reddit corpus from PushShift.io into Snowflake, and share more interesting use cases!

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. Check reddit.com/r/snowflake for the most interesting Snowflake news.


Articles for engineers, by engineers.