Big data stories in seconds: Hacker News and BigQuery

After having a lot of fun with reddit’s data on BigQuery(collected by @jasonbaumgart, see the announcement and Max Woolf’s Howto), it was time to play with another forum that attracts a lot of attention: Hacker News.

Firebase hosts the official Hacker News API, and my friends at Firebase (Jenny Tong, @JamesTamplin) helped me obtain a dump of all Hacker News stories and comments since 2007. With this data in BigQuery, it was time to start querying.

Let’s start by visualizing Hacker News growth 2007–2015:

It’s interesting to see how growth has been stagnant since 2012. Why? Not sure. In the meantime I left sample code to this and other visualizations in an IPython/Jupyter notebook. Also make sure to read the comments at the announcement Hacker News on BigQuery post (thx Max Woolf).

Other visualizations in said notebook include the best times to post on Hacker News to get more than -let’s say- 30 votes:

Then most fun part of having a dataset in BigQuery is the ability to start combining it with others. For example, GitHub. When a project gets posted to the Hacker News homepage it generates a lot of attention — can we measure this?

I left the instructions to combine both datasets on an reddit /r/bigquery post.

But the fun doesn’t end there! My latest experiment is looking at the story of Bitcoin through Hacker News and reddit:

How-to to combine the 3 datasets (another /r/bigquery post).

There are so many more discoveries to find here! For example, I recently saw a presentation from Kodok Márton where he looks into what are the most famous books — by finding Amazon links.

The best part? It only takes seconds to answer your questions once you find this data in BigQuery. If you’ve never done it, find out how — you’ll be running queries like these in less than 5 minutes from now.

November 9th update: Deedy made it to the Hacker News frontpage with a full analysis of these 9 years of HN.