Which is the most popular article on Hackernews in 2022H1?

Jinpeng Zhang
8 min readOct 10, 2022

Backgroud

As a Hackernews user, I have a bunch of questions that I want to find answers to.

Which is the most popular article on Hackernews in 2022?Who submitted the most articles in the past 3 months? Who is the 2nd one?...Who submitted the most comments in the past 7 days? Who is the 2nd one?...In a day, when is the most active period for articles/comments submissions?How many comments there are submitted this year?...

Construct a live collection of Hackernews’ Data

In order to find these questions’ answers, I need a live collection of Hackernews’ data. Fortunately, Hackernews provides a set of APIs (https://hackernews.api-docs.io/v0/overview/introduction) to retrieve its history and latest data:

Get the max item id

curl --request GET \
--url 'https://hacker-news.firebaseio.com/v0/maxitem.json?print=pretty' \
--data '{}'

Get a specified item (an item can be a story/article or a comment or a job, etc)

curl --request GET \
--url 'https://hacker-news.firebaseio.com/v0/item/%7Bitem-id%7D.json?print=pretty' \
--data '{}'

Get recently changed items and profiles

curl --request GET \
--url 'https://hacker-news.firebaseio.com/v0/updates.json?print=pretty' \
--data '{}'

Now, I have the API to grab Hackernews’ data, and I also need a continuously running crawler to grab the newly generated data.

I wrote a rust hackernews cralwer: https://github.com/zhangjinpeng1987/hackernews-crawler which can monitor the change of max item and grab those new submitted items. You can check the repo for more implementation details.

And then I need a database to store these items. At the beginning, I choose MySQL 8.0 as the database. The table schema is:

USE hackernews;
CREATE TABLE `items` (
`id` int(10) NOT NULL,
`deleted` tinyint(4) DEFAULT '0',
`type` varchar(16) DEFAULT NULL,
`who` varchar(255) DEFAULT NULL,
`time` int(11) DEFAULT NULL /*Creation date of the item, in Unix Time*/,
`dead` tinyint(4) DEFAULT '0',
`kids` text DEFAULT NULL,
`title` text DEFAULT NULL,
`content` text DEFAULT NULL,
`score` int(10) DEFAULT NULL,
`url` text DEFAULT NULL,
`parent` int(10) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `_idx_time` (`time`)
);
CREATE TABLE `maxitem` (
`id` int(10) NOT NULL,
`maxid` int(10) DEFAULT NULL,
PRIMARY KEY (`id`)
);

As we can see, the items table has 2 indices, one is the primary index id which is usually used by point queries like:

SELECT who, type, time, url FROM items WHERE id=30000001;
UPDATE items SET score=score+1 WHERE id=30000001;

And the other index is the time index which is usually used to query data in a specified (relevant narrow) time range:

SELECT count(*) FROM items WHERE time > UNIX_TIMESTAMP('2022-07-01');

I set the start item id as 30000001 for the crawler whose time is about 2022-01-20, and the crawler fetched all items after that time.

Run Queries

Which one is the most popular article of Hackernews in 2022?

Let’s go back to those questions we asked at the beginning of this article.Which one is the most popular article of Hackernews in 2022? There are two ways to judge how popular an article is, one is use the score, another is use the total number of comments.

Find the article with the highest score

The query looks like:

SELECT id as article, score from items WHERE time > UNIX_TIMESTAMP('2022-01-01')
ORDER BY score desc limit 10;

As we can see the article https://news.ycombinator.com/item?id=31261533 has the highest score.

Find the article with the most comments

The query looks like:

SELECT parent as article, count(*) AS comment_cnt FROM items 
WHERE time > UNIX_TIMESTAMP('2022-01-01') and type='comment'
GROUP BY article ORDER BY comment_cnt DESC LIMIT 10;

The answer is https://news.ycombinator.com/item?id=30934529

Why are these queries so slow?

Although we have got the answer, both of the above queries are slow. The 2nd query even took more than half minute.

There are about 2 million Hackernews items in total in 2022, and the disk size of these 2 million rows is about 2GB. Though both of these queries can use the index _idx_time, both of these queries need to walk through all of these 2 million rows. This makes both queries slow.

Slowness is just one shortcoming, what is more serious is that when running such expensive queries, there is a huge impact to these online transactional processing queries like point select, point update, etc.

Redesign the schema & index to speed up such ad-hoc queries?

Maybe we can redesign the schema, like partitioning the items table by time, and add partitioned index score, but it will make it difficult to run point queries like update/select where id=xxx. So I gave up on doing this.

24 hours a day, when is the most active period for articles/comments submissions?

SELECT HOUR(FROM_UNIXTIME(time)) as hour, COUNT(*) as number_of_articles 
FROM items WHERE time > UNIX_TIMESTAMP('2022-01-01')
and type='story' GROUP BY hour ORDER BY number_of_articles DESC;
SELECT HOUR(FROM_UNIXTIME(time)) as hour, COUNT(*) as number_of_comments 
FROM items WHERE time > UNIX_TIMESTAMP('2022-01-01')
and type='comment' GROUP BY hour ORDER BY number_of_comments DESC;

Not bad, both queries returned in less than 3 seconds. But if the total number of items is 10x or 100x larger, it will take 30s or 300s to get the answer.

Anyway, we can have some conclusions like: hackers were more active at night, and they produced more comments at 1 AM than 23 PM. But, hackers produced more articles at 23 PM than 1 AM. Maybe there is a delay between new comments and new articles. Interesting.

Who submitted the most articles/comments in the past 3 months? Who is the 2nd one?

Who submitted the most articles in the past 3 months, the query is:

SELECT who AS author, COUNT(*) as number_of_articles FROM items 
WHERE time > UNIX_TIMESTAMP('2022-04-01') and type="story" and who <> ''
GROUP BY author ORDER BY number_of_articles DESC LIMIT 10;

As we can see, the user Tomte submitted 1406 articles in the past 3 months, nearly 8 articles every day.

Who submitted the most comments in the past 3 months? The query is:

SELECT who AS commenter, COUNT(*) as number_of_comments FROM items 
WHERE time > UNIX_TIMESTAMP('2022-04-01') and type="comment" and who <> ''
GROUP BY commenter ORDER BY number_of_comments DESC LIMIT 10;

We can see, the user ncmncm is the No.1 commenter, submitting nearly 15 comments every day.

How many comments there are in this year

SELECT COUNT(*) FROM items WHERE time > UNIX_TIMESTAMP('2022-01-01') and type='comment';

Anyway to run these queries faster?

Use an analytical database

We can use an analytical database like ClickHouse/Spark to replace MySQL as the primary database. But you have to abandon the online transactional querying ability like:

  • Update/query Item by ID and update/query profile by user are hard.
  • Query these latest articles/comments in few seconds is a big challenge, because these analytical systems usually using batch to improve the performance and there is a notable data delay.

Fortunately there is a Clickhouse playground https://play.clickhouse.com/play?user=play that contains the hackernews data which only refresh the data every hour, anyway we can run these queries here:

But, unfortunately these queries didn’t produce result

Introduce data ware-house downstream

Definitely, we can introduce a data warehouse downstream like Spark. But there are two shortcomings:

  • You need to maintain the complexity of MySQL and Spark, besides the data replication pipe.
  • You may not see the real-time result.

Use real-time HTAP database like TiDB

I deployed a TiDB 6.1 cluster on the same machine with MySQL 8.0. And run the above queries. BTW, I followed this doc and created a columnar replica for table items. For those who don't want to deploy and maintain the TiDB cluster, you can try TiDB Cloud here.

ALTER TABLE items SET TIFLASH REPLICA 1;

Comparison between TiDB with MySQL

Why TiDB is so fast?

TiDB is an open-source NewSQL database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. There are two different types of storage engine inside TiDB, row storage engine TiKV and columnar storage TiFlash. TiDB can leverage the low latency ability of row storage engine and the fast scan ability of columnar storage engine.

Dual storage engines of TiDB

You can get more technical details from this TiDB VLDB paper https://www.vldb.org/pvldb/vol13/p3072-huang.pdf . I would like to give more interpret for this paper in my future blog.

--

--

Jinpeng Zhang

Engineering Director at PingCAP, focus on large scale distributed storage, database, change data capture, etc.