Snowflake
Published in

Snowflake

Loading 261GB of Reddit Comments into Snowflake

I used to load reddit comments onto BigQuery, now it’s time to upgrade my pipelines to Snowflake — and to share some of the nice surprises I found. Let’s get started with 261GB of them.

Querying for the subreddits with the most comments

Download and recompress into S3

Monthly reddit comments archive by PushShift.io

Download

17 hours to download a 16GB file
GCP Storage Transfer Service: Good service, didn’t work.

Re-compress

unzstd --long=31 -c ../RC_2020-12.zst -  | split -C 1000000000 -d -a4 - reddit_comments_202012_ --filter='gzip > $FILE.gz'# https://stackoverflow.com/a/68718176/132438
unzstd+split+gzip at work

Push to S3 and Snowflake

aws s3 cp --recursive . s3://my-s3/202108/reddit/
use role accountadmin;create or replace storage integration s3_int_reddit
type = external_stage
storage_provider = s3
enabled = true
storage_aws_role_arn = 'arn:aws:iam::4815162342:role/my_role'
storage_allowed_locations = ('s3://my-s3/');
desc integration s3_int_reddit;grant create stage on schema public to role sysadmin;
grant usage on integration s3_int_reddit to role sysadmin;
use role sysadmin;create or replace stage my_ext_stage_reddit
url='s3://my-s3/'
storage_integration = s3_int_reddit;

list @my_ext_stage_reddit;
create or replace table reddit_comments_sample(v variant)
;
copy into reddit_comments_sample
from @my_ext_stage_reddit/202108/reddit/
file_format = (type=json)
;

Fun with SQL

select count(*) comments
, min(v:created_utc)::timestamp since
, max(v:created_utc)::timestamp until
from reddit_comments_sample;
# 84,448,007 COMMENTS
# 2020-11-30 16:00:00.000 SINCE
# 2020-12-14 04:40:50.000 UNTIL
select key, value
from (
select *
from reddit_comments_sample
limit 1
), table(flatten(v))
limit 100;
This JSON object has 48 easy-to-navigate keys with Snowflake
select v:subreddit::string sub, count(*) c, avg(v:score) avg_score
from reddit_comments_sample
group by 1
order by 2 desc
limit 10
Subreddits with the most comments, first half of December 2020, and average score
Subreddits with the most comments, first half of December 2020
select v:subreddit::string sub
, count_if(v:all_awardings > '[]') count_awards
, count_awards/count(*) ratio_with_awards
, sum(v:all_awardings[0]:coin_price) sum_coin_price
, avg(v:all_awardings[0]:coin_price) avg_coin_price
from reddit_comments_sample
group by sub
order by 2 desc
limit 100
Comments on /r/conservative have an above average 0.9% chance to get reddit gold
Subreddits with the highest chance for comments receiving reddit gold

Update: Now the whole month is in

create or replace table reddit_comments_202012(v variant)
cluster by (v:subreddit::string, v:created_utc::timestamp)
;
copy into reddit_comments_202012
from @my_ext_stage_reddit/202108/reddit/
file_format = (type=json)
;
6 minutes to load and cluster 191,317,162 comments on an L warehouse

Next steps

Language detection with Java UDFs

Most common languages detected on 20,000 reddit comments (other than English)

Recursive queries

with data as (
select *
from reddit_comments_sample
where v:subreddit = 'dataisbeautiful'
)
select *
from (
select level, v:parent_id, sys_connect_by_path(v:author, ' -> ')
from data
start with substr(v:parent_id, 1, 3) = 't3_'
connect by prior v:id = substr(v:parent_id, 4)
)
order by level desc
limit 100
Finding long threads on /r/dataisbeautiful

Want more?

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Felipe Hoffa

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.