Puppets

Web Scraping With Express And Puppeteer

Josh Hicks
Oct 23, 2018 · 5 min read

Using Headless Browsers To Surf The Web

Today we’re going to be building an app that will scrape two websites for anything that matches a certain keyword. The two websites will be Medium.com and YouTube.com. Here we’ll be looking for article or video that contains the words “Headless Browser”. The hope is that the resulting list of resources will help us expand our knowledge on this topic.

In this project we’ll be using the following technologies:

Node will be the environment that’s used to run our code. Express will be used to run the server that serves the template. We will create the HTML content with the Pug template engine. Last but not least is the star of the show, Puppeteer! This is the headless browser tool that will make this whole thing possible.

What Is A Headless Browser, And Why Does It Exist?

It’s important to know a little bit about why headless browsers are so interesting and useful. The concept of headless browsers can be a little hard to wrap your head around (no pun intended!) at first. The simplest way of thinking about it is, a regular web browser without a GUI (Graphical User Interface). That means you can do everything a normal browser can do but, it must be done in a different way. Normally, this is normally done in some command line environment.

Here’s what Puppeteer has to say about it on their website:

Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.

This doesn’t sound very useful at first. I mean, who wants to read the news in a command line terminal? The true power of this technology is revealed when you think start thinking about one main concept, automation!

Headless browsers are normally used for a few main things:

  • Automated Testing
  • Taking Screenshots
  • Interacting With Websites
  • Scraping Data

That last one sounds pretty darn useful! Let’s start building our demo app so you can really see how creative and fun this tool can be.


Building The App

You can find the full code on my Github account here.

1: Project Setup

  • Make sure you have Node v7.6.0 or greater installed
  • Run npm i express pug puppeteer --save

Step 2: The Express Server

./server.js

I won’t go in to much detail about how Express works but I will explain what’s going on here in basic English as well as I can.

The first thing we do at the top of the file is to import what we need. That includes our “scraper” functionality that we put in another module. Then we’re telling Express that we’ll be using Pug as our template engine. After that, we are setting up a listener for any GET request that comes through on our base route of / . Lastly, at the bottom we’re starting the actual server.

The most important part of this file is what happens inside the callback function in the GET request handler. Here we are creating two new promises based on values returned from our scraper utility functions. If you’re not familiar with Promises, check out this article. After those two promises are created, we then pass them into a Promise.all() which will wait until each one has resolved and then execute another callback. Inside that final callback we are taking the data that was returned and using it to send back a page to the browser for our users to view.

Now we’re ready to take a look at what these “Scraper” functions are actually doing.

Step 3: The Scraper Functions

./utils/scraper.js

This is where the magic happens!

Here we have two main functions, scrapeMedium and scrapeYoutube. These will be the main functionality of this app and also the part where Puppeteer comes in.

Both functions essentially do the same thing, open a Puppeteer headless browser instance, go to a specified website, find the elements you want, and grab the data within.

Sounds simple, but it’s actually a very powerful concept!

The trick to this part is to know exactly what you’re looking for ahead of time. In this case I had to go to each target website, visually look at the information on the page, inspect the elements and then make note of the target class/element names I wanted. Then using basic JavaScript document queries, I am able to grab all the data I need.

Puppeteer has such a great API for dealing with websites. With methods that can help you do pretty much anything you would want to do. Not to mention great documentation to match.

Once, it has retrieved all the data it needs, it closes the headless browser and returns the data it found as a promise.

Now it’s time to actually display the data we scraped!

Step 4: Add template to display results

At this point we have all the data we need and we’re ready to show it to the user. This will happen when our Express server returns the compiled Pug template. Let’s see what the Pug template looks like.

If you’re not familiar with Pug or other template engines like EJS or Handlebars, the main goal is to add extra functionality to your plain HTML files. Sometimes they also try to simplify your code. That happens to be the case with Pug. You will immediately notice that there are no angle brackets in this file. It saves a lot of extra typing by removing those.

In this file you can see that we’re looping through data.articles and for each article we’re creating a link to that resource. The data object was passed in to this file in the callback function in the Express server.

At this point when you start the Node server by running npm start then visit localhost:3000 or http://127.0.0.1:3000 in your browser you will now see two lists of links! 😃

Summary

Let’s recap what we’ve accomplished here.

First, we created an Express server to handle GET requests to your website and return an HTML file to the user. Then, we covered the basics of Puppeteer and how powerful it can be to run a headless browser with your code. Finally, we built the template that will take all that hard-earned data and display it for the user.

WOW! 😱

If you’re not already full of ideas about how you can use this technology to do all kinds of cool stuff, let me give you a bonus assignment.

Bonus Work

Take the current project and instead of using the hard-coded “headless browser” search criteria, make it use search terms entered by the user.


I hope you enjoyed this article and you have some good ideas about how to use Puppeteer in your next project. Feel free to let me know if there are any mistakes or things I missed in this article. And of course, please leave any questions or comments you have and I’ll do my best to respond.

Josh Hicks

Written by

Software engineer, writer, traveler, weight lifter. Find more from me at www.hirejoshhicks.com

More From Medium

More from Josh Hicks

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade