How to load the Stack Overflow XML dumps into Snowflake

We can use the Stack Overflow data dumps to analyze the health of any tags and much more. But before analyzing, let’s load this data while showing off the Snowflake abilities to parse XML and files compressed with exotic formats.

Image created by AI

Imagine an ideal world

Before joining Snowflake I wrote many posts analyzing Stack Overflow’s data — now it’s time for me to share how you can load these dumps into your account.

It would be awesome if the Stack Overflow dumps were already shared in the Snowflake Marketplace. But since they are not (yet), this is a great opportunity to show off the Snowflake abilities to read XML.

I also wish Stack Overflow shared these dumps compressed with gz for easy ingestion — but they chose to compress with 7z instead. This gives us the opportunity to show off how Snowflake can process binary files — like decompressing and parsing a 7z archive on the fly.

Let’s get started.

Reading a .7z file with a Snowflake UDF

Let’s start by downloading the Users.7z Stack Overflow dump, and then putting it into a Snowflake stage:

> wget https://archive.org/download/stackexchange/stackoverflow.com-Users.7z

> snowsql

> put file://./stackoverflow.com-Users.7z @~ auto_compress=false;

| source | target | source_size | target_size | source_compression | target_compression | status | message |
|----------------------------+----------------------------+-------------+-------------+--------------------+--------------------+----------+---------|
| stackoverflow.com-Users.7z | stackoverflow.com-Users.7z | 716792093 | 716792096 | NONE | NONE | UPLOADED | |

That’s a 716MB compressed file — which won’t fit in memory when decompressed. That’s ok, we can write a Python SQL UDTF that will return each line as a row:

create or replace function read_lz7()
returns table(l string)
language python
runtime_version=3.8
handler='X'
packages= ('libarchive', 'python-libarchive-c')
imports = ('@~/stackoverflow.com-Users.7z')
as $$
import os
import sys
import libarchive

class X:
def process(self):
my_file = os.path.join(sys._xoptions["snowflake_import_directory"], 'stackoverflow.com-Users.7z')
early_stop_counter = 0
data = ''

with libarchive.file_reader(my_file) as e:
for entry in e:
for block in entry.get_blocks():
#early_stop_counter += 1
#if early_stop_counter > 100:
# return
data += block.decode('ISO-8859-1')
lines = data.splitlines()
if not data.endswith('\n'):
data = lines.pop()
else:
data = ''
for line in lines:
yield (line,)
$$;

Notes:

  • It’s easy to decompress a .7z file in Python using python-libarchive-c (ready to be used inside Snowflake thanks to Anaconda).
  • Java UDFs are able to access random files — but Python UDFs need to have them listed in imports (for now, stay tuned).
  • I had trouble processing wholes lines, so I asked Stack Overflow to improve my code.
  • The function doesn’t return until the whole file is processed. For debugging purposes and a faster return, you can use the commented early_stop_counter.
  • (The uncompressing might fail with very large files — I had no issues with Users.7z, but I’m still debugging for Posts.7z)

With that UDTF in hand, we can now materialize the 7z files into a traditional table, while parsing it as an XML inside Snowflake:

select  x:"@Id"::int id
, x:"@DownVotes"::int downvotes
, x:"@UpVotes"::int upvotes
, x:"@Views"::int views
, x:"@Reputation"::int reputation
, x:"@CreationDate"::timestamp creation
, x:"@LastAccessDate"::timestamp last_acces
, x:"@DisplayName"::string display_name
, x:"@AboutMe"::string about
from (
select parse_xml(l) x
from table(read_lz7())
where check_xml(l) is null
)
;
The earliest Stack Overflow users — many of them still active

Reading a .7z file with an AWS EC2 VM

Before having the ability to decompress 7z within Snowflake, I had been using AWS VMs to download and decompress the archives. You can do the same in your favorite cloud — just choose a VM close to your Snowflake account.

These were my steps to get the way larger Posts.7z (18GB):

  • Start a t4g.medium EC2 VM with 60GB of storage.
  • Then run the following script to download the dump, install p7zip, re-compress as gz, and then send to S3:
wget https://archive.org/download/stackexchange/stackoverflow.com-Posts.7z
# >2 hours

sudo yum-config-manager --enable epel
sudo amazon-linux-extras install epel
sudo yum install -y p7zip

7za x -so stackoverflow.com-Posts.7z| gzip > stackoverflow.com-Posts.gz

aws configure
aws s3 cp stackoverflow.com-Posts.gz s3://my-s3/202303/

Notes:

  • The 18gb files decompresses into ~90gb — but by compressing it online into gz we end up with a ~23gb file.
  • For faster ingestion into Snowflake, I should have split it into multiple .gz files — but I skipped that part this time.

Once in S3, we can read this .gz into a Snowflake table — just set up the S3 bucket as an external stage:

list @my_ext_stage;

create table stackoverflow_posts_raw202303 (r string);

copy into stackoverflow_posts_raw202303
from @my_ext_stage/202303/stackoverflow.com-Posts.gz
FILE_FORMAT = (type = csv field_delimiter = '\0')

Notes:

  • Use a small warehouse for this step. It will take around ~22 minutes regardless of the size of the warehouse (unless you split the gz as suggested earlier).
  • This used to be slower, but Snowflake keeps getting faster.

Those are raw lines — to transform them into XML and then into a normal looking table we can do the following step. Note that from now on the larger warehouses will make everything faster:

create or replace table stackoverflow_posts_xml202303 as
select parse_xml(r) x
from stackoverflow_posts_raw202303
where check_xml(r) is null
-- 27s 3XL
;

create or replace table stackoverflow_posts202303
as
select x:"@Id"::number id
, x:"@CreationDate"::timestamp creation_date
, x:"@LastActivityDate"::timestamp last_activity_date
, x:"@LastEditDate"::timestamp last_edit_date
, x:"@LastEditorUserId"::number last_edit_user_id
, x:"@OwnerUserId"::number owner_user_id
, x:"@ParentId"::number parent_id
, x:"@PostTypeId"::number type_id
, x:"@Score"::number score
, split(trim(x:"@Tags", '><'), '><') tags
, x:"@Title"::string title
, x:"@Body"::string body
, x:"@AcceptedAnswerId"::number accepted_answer_id
, x:"@ViewCount"::number view_count
, x:"@AnswerCount"::number answer_count
, x:"@CommentCount"::number comment_count
, x:"@FavoriteCount"::number favorite_count
, x:"@ClosedDate"::timestamp closed_date
, x:"@CommunityOwnedDate"::timestamp community_owned_date
, x:"@OwnerDisplayName"::string owner_display_name
, x:"@LastEditorDisplayName"::string last_editor_display_name
from stackoverflow_posts_xml202303
order by (type_id, id)
-- 18s 3XL
;

select max(creation_date)
from stackoverflow_posts202303
-- 2023-03-05 06:02:46.817
;

Note that you don’t necessarily need to transform the XML table into a traditional one. A view can be pretty fast too — as Snowflake stores the XML in columnar format:

create or replace view stackoverflow_posts202303_view
as
select x:"@Id"::number id
, x:"@CreationDate"::timestamp creation_date
, x:"@LastActivityDate"::timestamp last_activity_date
, x:"@LastEditDate"::timestamp last_edit_date
, x:"@LastEditorUserId"::number last_edit_user_id
, x:"@OwnerUserId"::number owner_user_id
, x:"@ParentId"::number parent_id
, x:"@PostTypeId"::number type_id
, x:"@Score"::number score
, split(trim(x:"@Tags", '><'), '><') tags
, x:"@Title"::string title
, x:"@Body"::string body
, x:"@AcceptedAnswerId"::number accepted_answer_id
, x:"@ViewCount"::number view_count
, x:"@AnswerCount"::number answer_count
, x:"@CommentCount"::number comment_count
, x:"@FavoriteCount"::number favorite_count
, x:"@ClosedDate"::timestamp closed_date
, x:"@CommunityOwnedDate"::timestamp community_owned_date
, x:"@OwnerDisplayName"::string owner_display_name
, x:"@LastEditorDisplayName"::string last_editor_display_name
from stackoverflow_posts_xml202303
;

Keeping a history of views

To build a story line of what tags and queries get more views per quarter, I’ve been keeping an archive of all the snapshot number of views:

insert into stacko_history
select id, score, view_count, '2023-03-01'::date
from stackoverflow_posts202303_view
where type_id=1;

With that historical table we can find out the total number of page views that Stack Overflow gets per quarter:

select quarter 
, sum(view_count) total_views
, (total_views-lag(total_views) over(order by quarter) quarter_views
from stacko_history
group by 1
order by quarter desc
Total number of page views that Stack Overflow gets per quarter

I was surprised to see that the number has been slowly growing since 2017 —until a huge peak of 2.5 billion views in the quarter of 2022–09, and then with a huge drop in the quarter of 2022–12. But then I remembered that their snapshot in September came out pretty late:

select max(creation_date)
from stackoverflow_posts202303;
-- 2023-03-05

select max(creation_date)
from stackoverflow_posts202212;
-- 2022-12-04

select max(creation_date)
from stackoverflow_posts202209;
-- 2022-09-25

With 20 extra days — can you see what drove that peak? This is the life of a data scientist/engineer: Surprises can often explained by anomalies on data collection — or simple facts like February having less days than the other months.

With that said, the following query applies a quick & dirty fix for one quarter having 112 days, and the following one only 70:

select quarter 
, sum(view_count) total_views
, ((total_views-lag(total_views) over(order by quarter)) *
coalesce(
{'2022-09-01':90/112, '2022-12-01':90/70}[quarter]::float, 1))::int quarter_views
from stacko_history
group by 1
order by quarter desc
Total number of page views that Stack Overflow gets per quarter (ajdusted)

This adjusted chart shows a normal peak of page views for Stack Overflow on the December quarter, with a normal March drop.

A story of pageviews per quarter per tag

With that we can now see how many pageviews per quarter Snowflake is getting on Stack Overflow — compared to other products in the data analytics world:

with interesting as (
select array_construct(
'snowflake-cloud-data-platform'
, 'google-bigquery'
, 'databricks'
, 'amazon-redshift'
, 'azure-synapse'
) tags
), views_per_question_tag as (
select a.id
, array_agg(array_construct(quarter, a.view_count)) within group (order by quarter desc) views
, any_value(b.tags) tags
, any_value(title) title
from stacko_history a
join (
select *
from stackoverflow_posts202303_view
where type_id=1 -- questions
) b
on a.id=b.id
join interesting
where arrays_overlap(interesting.tags, b.tags)
group by 1
)

select tag.value::string tag, v.value[0]::date quarter, sum(v.value[1]) total_views, count(*) c
, total_views-lag(total_views) over(partition by tag order by quarter) q_views_uncorrected
, (q_views_uncorrected * coalesce(
{'2022-09-01':90/112, '2022-12-01':90/70}[quarter]::float, 1))::int q_views
from views_per_question_tag, table(flatten(tags)) tag, table(flatten(views)) v
join interesting
where arrays_overlap(tag.value, interesting.tags)
group by 1, 2
qualify q_views>0
order by tag, quarter desc
;
Page views per quarter for each tag on Stack Overflow (adjusted)
Growth in Stack Overflow quarterly pageviews per tag

And here we can see my favorite news: Snowflake was the only one of major players that got an increasing number of page views on the quarter of March 2023. This thanks to our awesome users, teammates, and everyone working hard to ask and answer questions on the Snowflake tag on Stack Overflow. Thanks all!

Next steps

  • This is only the start — what else can you find out in the Stack Overflow archives?
  • How about sharing a Streamlit app to make these results interactive for every tag? Stay tuned.
Image create by AI

Want more?

I’m Felipe Hoffa, Data Cloud Advocate for Snowflake. Thanks for joining me on this adventure. You can follow me on Twitter and LinkedIn. And subscribe to reddit.com/r/snowflake for the most interesting Snowflake news.

--

--

Felipe Hoffa
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Data Cloud Advocate at Snowflake ❄️. Originally from Chile, now in San Francisco and around the world. Previously at Google. Let’s talk data.