Scraping Hacker News

A simple python program

Ali Shahed
ML Hobbyist
Published in
3 min readNov 27, 2017

--

Hacker News (HN) is a one of my major sources of information but it is also a major source of pain, specially if I want to follow a certain topic daily or weekly, e.g. driverless cars, I have to :

  • First enter the query in the search box.
  • Order the query results by time.
  • Click on all links for multiple pages for each result and check the contents one-by-one.

One rainy afternoon, I thought to myself: “There should be a better way!”. So, I turned on my coffee maker and by that evening, I wrote a notebook that does the following:

  1. Queries the Hacker News (HN) API with your query of choice.
  2. Gets the links to the latest news on your query.
  3. Scrapes the content of the links which are returned by HN API.
  4. De-duplicates the content.
  5. Saves all this information in an Excel file.

You can access the Google Colaboratory * Notebook here and I’ll walk you through what it does!

Requirements

  • Requests package to communicate with the rest API that is used by HN.
  • Pandas is used to organize the data from HN API in a dataframe.
  • BeautifulSoup is an awesome package to scrape content from links
  • files from google.colab is used to download the final Excel file to your local disc (You can also save the final data in a google sheet).
  • Note: You don’t need the Simplejson for running the code, but it comes handy if you want to investigate the API response.

Query HN’s API

The code will make a request to the api url based on the source specified and return the results from there. Here are a couple pointers about the code:

  1. You can extend the function to other sources, e.g. reddit, simply by having variable source as a list and extend the if-statement with new sources.
  2. HN API will return your desired links when you add search_by_date in the HN API call (Handy!)

Link Scraper

The scraper function input is a dataframe that we generated using api_call. The dataframe includes the URLs of all the links we received from the HN API. In this function, we go through the URLs, scrape them and extract the contents. For that, I found all the text in html that are in <p></p> format and concatenated them with newline (\n).

Example: “autonomous vehicle”

In this example I demonstrate the steps that I take from API call to download the excel file.

  1. Using the api_call function we read the first 6 pages of HN. Note that I use the default query in the definition of the function. If you are interested in another query just add query = “<your query>” in the function input.
  2. The scraper function fills the content column of the dataframe using URLs provided by HN API.
  3. I then remove the empty rows, de-duplicate the links with contents with the same title and reset the index of the dataframe.
  4. The next couple of lines generate a string for the date that the function is ran. I use this string in the name of the Excel file.
  5. We can then download that file from Google Drive / Colaboratory using the following line of Python:
Example of the resulting Excel file

What’s next?

There are many ways one can proceed from here, e.g.

  1. Boolean query capability by having multiple calls and set operation on dataframes.
  2. Add more sources such as reddit or Quora to the api_call function.
  3. A nice user interface also doesn’t hurt!

* What is Google Colaboratory?

Collaboratory is basically is a Jupyter notebook that you can run in your browser. It is connected to your google drive and …wait for it… behaves like google doc, i.e. you can share the notebook and multiple people can edit the same code at the same time. Also, you can read and write datasets from/to google drive too, even in google sheets.

--

--

Ali Shahed
ML Hobbyist

PhD EE | Data Scientist | Machine Learning Engineer