Hacker News on BigQuery: Now with daily updates — So what are the top domains?

We published a copy of all Hacker News contents in BigQuery back in 2015. It was time for an update, and even better than that, how about daily updates? In this post let’s look at the HN favorite sources, and changes in the best times to post.

Felipe Hoffa
Mar 17, 2017 · 5 min read

Thanks to the BigQuery Public Datasets Program, we now update Hacker News in BigQuery daily. Nothing better than having fresh data to analyze every day. To celebrate the occasion, I want to look at the top domains that Hacker News uses as sources. I recently did something similar for reddit, which was full of interesting surprises.

So what are the top domains shared in Hacker News during 2017?

Most frequent domains on HN, 2017. Source: Hacker News copy shared in BigQuery
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') domain, COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY domain ORDER BY c DESC LIMIT 10

That’s interesting, but not what most users see. Let’s rank by number of posts that have scored at least 40 upvotes — the ones that make it to the front page:

Most frequent domains with score>40 on HN, 2017. Source: Hacker News copy shared in BigQuery
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1 ORDER BY 3 DESC LIMIT 10

Certainly Hacker News likes content hosted on sites like github.com and the NYTimes. But some of those radios look abysmal. What domains have the best chance of getting more than 40 upvotes?

Domains which have landed the most % of submissions in the HN frontpage, 2017 (score>40, count>30). Source: Hacker News copy shared in BigQuery
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
, ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1
HAVING count>30
ORDER BY 4 DESC LIMIT 20

Oh, that’s cool! As a Googler I’m also proud to see that 3 of our blogs are on the top 10 of Hacker News worthy content. And a shoutout to Gwern Branwen, the only individual author in the top 20.

Let’s look at all the domains that have submitted at least 300 posts already this year:

All domains with > 300 posts on HN, 2017, ranked by chances of getting >40 upvotes. Source: Hacker News copy shared in BigQuery
#standardSQL
SELECT REGEXP_EXTRACT(url, '//([^/]*)/?') top_domains_2017, COUNT(*) count, COUNTIF(score>40) score_gt_40
, ROUND(100*COUNTIF(score>40)/COUNT(*),2) chances_of_homepage
FROM `bigquery-public-data.hacker_news.full`
WHERE url!='' AND EXTRACT(YEAR FROM timestamp)=2017
GROUP BY 1
HAVING count>300
ORDER BY 4 DESC

Lessons learned:

  • Main stream sites like the washingtonpost.com, bbc.com, bloomberg.com, and nytimes.com have the greatest chance of producing front-page worthy content. All above 10%.
  • TechCrunch continues to be one of the HN favorite tech sources (>9%).
  • For self hosting the best platform seems to be github.com (>7%). Meanwhile self hosted sites like medium.com and hackernoon.com (both Medium platform) have a < 3% chance of reaching the front page. So should I move this post out of Medium and into a GitHub project for HN to read it?
  • Wikipedia and Youtube videos have a low chance of frontpage (~2%).
  • Don’t use link shorteners (goo.gl, youtu.be): They exhibit 0% chance of frontpage.

Looks consistent with the 2016 results:

All domains with > 1350 posts to HN in 2016, ranked by chances of getting >40 upvotes. Source: Hacker News copy shared in BigQuery

And who remembers these top domains from 2008 and 2009?

Best time to submit for a front page chance

Many want to know: What’s the best time to submit a story to HN for a front page chance (see Max Woolf’s analysis)? Turns it out it varies week by week. Plotted here for each week during the last 4 years, which was the hour with the highest chance of front page:

Turns out there was a clear pattern for years: Don’t submit in the middle of the California day. That changed during 2016 — was there an algorithm or behavioral change driving this? Only the admins know.

But what I do know is that Saturdays behave way different than Mondays, for sure. I’ll let you play with that:

#standardSQL
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY month ORDER BY chances_of_homepage DESC) rn
FROM (
SELECT TIMESTAMP_TRUNC(timestamp, WEEK) month
, EXTRACT(HOUR FROM timestamp AT TIME ZONE 'US/Pacific') hour
, ROUND(100*COUNTIF(score>40)/COUNT(*),4) chances_of_homepage
, COUNT(*) c
FROM `bigquery-public-data.hacker_news.full`
WHERE type='story' AND EXTRACT(YEAR FROM timestamp)>2012
AND EXTRACT(DAYOFWEEK FROM timestamp)=2
GROUP BY 1,2
)
)
WHERE rn=1 ORDER BY 1,2

Want more?

Want more stories? Check my Medium, follow me on twitter, and subscribe to reddit.com/r/bigquery. And try BigQuery — every month you get a full terabyte of analysis for free.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store