Web Scraping With Express And Puppeteer
Josh Hicks
2162

Here’s a way to fix the scraper code part, and improve the code quality.

And why XPaths would have eased the code, over CSS selectors

Medium scraper code is not valid:

The scraper code for the Medium part is broken, because the CSS selectors used are not accurate enough / does not target the right desired HTML nodes. Indeed, the extracted Medium articles titles do not match the associated Medium articles links.

It can easily be tested and proven directly within your preferred web browser console, by injecting and executing the corresponding Javascript code:

Extracted Medium articles titles don’t match the Medium articles links

The trick here is to query the DOM for the right selector, matching only the nodes we want, and eventually filter the HTML nodes obtained by applying a node.querySelector(‘.graf — title’) criteria (meaning we only want “a” nodes having a descendant node with a ‘graf — title’ class.

Array.from(document.querySelectorAll(‘div.postArticle-content a:first-child[data-action-value]’))
.filter( node => node.querySelector(‘.graf — title’))
.map( link => (
{
title: link.querySelector(‘.graf — title’).textContent,
link: link.getAttribute(‘data-action-value’)
}
))
);

We would not have had this issue if, instead of using CSS selectors, we had used XPath selectors, such as:

//div[@class=’postArticle-content’]/a[1][@data-action-value][.//*[contains(@class, ‘graf — title’)]]

We now have the right desired results:

Extracted Medium articles titles matching the right articles link

Improve scraper code:

Declarative programming is when you say what you want, and imperative language is when you say how to get what you want.

The code used to write the scraper part (data extraction step) is mostly in the imperative programming style. Instead of telling what we want to achieve, this code only says how we want to do it.

But it should be the other way around: tell what you want to achieve, rather than how to do it.

Instead of using complicated logic with a mix of .forEach() and .map() calls, we could just do as follows:

Also, the headless browser can be opened only once, there is no need to create a new instance every time our express server gets a new request.

This way, we save the web browser launch overhead, and always use the same instance. Opening a new web page / browser tab takes less time.<

Hope this could help :)