Scraping Medium’s home stories with headless Google Chrome
Using headless Google Chrome has become easier than ever with the release of puppeteer’s Google Chrome library. The library allows to open any web page and scrape its content automatically as if we were doing it through Chrome DevTools console. Awesome, right?
Maybe you have heard several times about scraping and its possibilities but never got time to test it. This article is a walkthrough to web scraping using NodeJS and headless Google Chrome.
Let’s get started
As you may know, Medium’s homepage contains a feed of curated content that you might be interested in, grouped by categories such as Technology, Creativity, Entrepreneurship, and Top today’s stories.
What if I want to get those stories to know what is trending today and how trending topics change over time? Well, that is the perfect use case for an automated scraping!
Determining which information is essential for us
Automated scraping let us extract the information we want for any specific page, but the process of knowing which data is relevant comes first.
The page features current trending stories and the category they belong to, so we can highlight the story identifier, story title, story link, story date, story description, story author and story image as the most important attributes.
Taking that into account, the best way to structure our information will be grouping the stories by its category at a first level, and an array of stories information inside each category item.
name: "Category Name",
title: "Story Title",
description: "A brief description of the story",
author: "Jesús Botella",
Configuring the environment
Google Chrome’s Puppeteer library makes this process easier than before. To install it, we only need to have a node project and run the following command in the console:
npm i puppeteer
In addition to the library, it will download the latest chrome binary release that is guaranteed to work on your machine, so we don’t have to worry about it.
And then, all is set to start coding the main script. Pretty easy, right?
Launching our automated headless Google Chrome instance
Now that we know the data to extract, and the environment is ready, let’s tell the browser what it needs to do.
This automated browser behaves the same way it would do when the user controls it via GUI.
The interesting lines are lines 4 to 11.
In line 5, we launch the browser instance. In line 8, we create a new page instance, which is like opening a new tab in our browser. Then, in line 11, we navigate to medium.com page with an option called waitUntil.
This setting means that page navigation won’t finish until network activity stays idle for 1sec. Therefore, all the page.evaluate function calls or actions that occur in the browser context will wait until that event, which is useful when scraping PWA.
Just a few lines of code to navigate where we want, but now what? Now we want to extract the data, right?
Getting the data and executing code inside our browser instance
Puppeteer allow us to run code inside the page context using page.evaluate method. The body’s function will be transferred to the browser instance through DevTools protocol and executed as you would via browser’s console.
All the code I used in the example is plain JS, but Puppeteer allows to inject any library you need (jQuery too).
If you want to use any of the ES6 features, you will need to check browser’s version to know its compatibility with your browser build.
This is the code I used to retrieve all the information from the page:
Dividing the whole process into steps, we need to gather all the sections first, and then extract its name as well as its stories information. For that kind of work, we use auxiliary functions.
These functions, such as extractHomeFeedSections, extractSectionName, extractPostsInformation, are the ones which retrieve the data from the HTML elements held in the DOM tree.
The data extraction divides into two parts:
- Retrieving the nodes that we want to get the info from
The easiest way to do it is by using querySelectorAll and querySelector functions from Element’s Document API.
- Getting the inner content or any property of the nodes we found
We can usually find the info that we want inside the content of the node that we retrieved, but this is not always true. It may be within one style property or any attribute of the HTML tag.
So, if we take a look at our extractStoriesInformation function, we can see the two phases I mentioned.
The first phase is to locate all the story elements inside each section element by querying each node. And then, we look for the inner components that contain our relevant information.
In the second phase, once we have the nodes, we can extract the information by getting their innerText, or by getting an attribute of the queried node.
If you want to understand better each chunk of code, take a look at the whole section markup of the Medium page. Here I give you a sneak peak of the categories container:
What do we receive when page.evaluate resolves?
Page.evaluate returns a Promise which resolves to the inner function return value, so we need to wait for the response by using await keyword or by using Promise.then method. According to our function body, the variable assignment will receive an array containing all the categories and its stories.
Checking the final result
The result will be printed in the console when executing the main script.
There will be N array items (as much as the story sections), with 4 or 5 stories each (depending on the story section layout). Here I show you a little excerpt from the information I got:
"title": "It’s Reboot Time for “Operating Systems”",
"description": "Thinking beyond programming languages",
As you can see, we got all the information we wanted from the stories. Getting them grouped by its category and ready to use for whatever we want to.
Here you have the complete example of what I gathered today: https://github.com/jesusbotella/medium-stories-scraping/blob/master/trendingStoriesExample.json.