Using Headless Browsers To Surf The Web
Today we’re going to be building an app that will scrape two websites for anything that matches a certain keyword. The two websites will be Medium.com and YouTube.com. Here we’ll be looking for article or video that contains the words “Headless Browser”. The hope is that the resulting list of resources will help us expand our knowledge on this topic.
In this project we’ll be using the following technologies:
Node will be the environment that’s used to run our code. Express will be used to run the server that serves the template. We will create the HTML content with the Pug template engine. Last but not least is the star of the show, Puppeteer! This is the headless browser tool that will make this whole thing possible.
What Is A Headless Browser, And Why Does It Exist?
It’s important to know a little bit about why headless browsers are so interesting and useful. The concept of headless browsers can be a little hard to wrap your head around (no pun intended!) at first. The simplest way of thinking about it is, a regular web browser without a GUI (Graphical User Interface). That means you can do everything a normal browser can do but, it must be done in a different way. Normally, this is normally done in some command line environment.
Here’s what Puppeteer has to say about it on their website:
Puppeteer is a Node library which provides a high-level API to control Chrome or Chromium over the DevTools Protocol. Puppeteer runs headless by default, but can be configured to run full (non-headless) Chrome or Chromium.
This doesn’t sound very useful at first. I mean, who wants to read the news in a command line terminal? The true power of this technology is revealed when you think start thinking about one main concept, automation!
Headless browsers are normally used for a few main things:
- Automated Testing
- Taking Screenshots
- Interacting With Websites
- Scraping Data
That last one sounds pretty darn useful! Let’s start building our demo app so you can really see how creative and fun this tool can be.
Building The App
You can find the full code on my Github account here.
1: Project Setup
- Make sure you have Node v7.6.0 or greater installed
npm i express pug puppeteer --save
Step 2: The Express Server
I won’t go in to much detail about how Express works but I will explain what’s going on here in basic English as well as I can.
The first thing we do at the top of the file is to import what we need. That includes our “scraper” functionality that we put in another module. Then we’re telling Express that we’ll be using Pug as our template engine. After that, we are setting up a listener for any GET request that comes through on our base route of
/ . Lastly, at the bottom we’re starting the actual server.
The most important part of this file is what happens inside the callback function in the GET request handler. Here we are creating two new promises based on values returned from our
scraper utility functions. If you’re not familiar with Promises, check out this article. After those two promises are created, we then pass them into a
Promise.all() which will wait until each one has resolved and then execute another callback. Inside that final callback we are taking the data that was returned and using it to send back a page to the browser for our users to view.
Now we’re ready to take a look at what these “Scraper” functions are actually doing.
Step 3: The Scraper Functions
This is where the magic happens!
Here we have two main functions,
scrapeYoutube. These will be the main functionality of this app and also the part where Puppeteer comes in.
Both functions essentially do the same thing, open a Puppeteer headless browser instance, go to a specified website, find the elements you want, and grab the data within.
Sounds simple, but it’s actually a very powerful concept!
Puppeteer has such a great API for dealing with websites. With methods that can help you do pretty much anything you would want to do. Not to mention great documentation to match.
Once, it has retrieved all the data it needs, it closes the headless browser and returns the data it found as a promise.
Now it’s time to actually display the data we scraped!
Step 4: Add template to display results
At this point we have all the data we need and we’re ready to show it to the user. This will happen when our Express server returns the compiled Pug template. Let’s see what the Pug template looks like.
If you’re not familiar with Pug or other template engines like EJS or Handlebars, the main goal is to add extra functionality to your plain HTML files. Sometimes they also try to simplify your code. That happens to be the case with Pug. You will immediately notice that there are no angle brackets in this file. It saves a lot of extra typing by removing those.
In this file you can see that we’re looping through
data.articles and for each article we’re creating a link to that resource. The
data object was passed in to this file in the callback function in the Express server.
At this point when you start the Node server by running
npm start then visit
http://127.0.0.1:3000 in your browser you will now see two lists of links! 😃
Let’s recap what we’ve accomplished here.
First, we created an Express server to handle GET requests to your website and return an HTML file to the user. Then, we covered the basics of Puppeteer and how powerful it can be to run a headless browser with your code. Finally, we built the template that will take all that hard-earned data and display it for the user.
If you’re not already full of ideas about how you can use this technology to do all kinds of cool stuff, let me give you a bonus assignment.
Take the current project and instead of using the hard-coded “headless browser” search criteria, make it use search terms entered by the user.
I hope you enjoyed this article and you have some good ideas about how to use Puppeteer in your next project. Feel free to let me know if there are any mistakes or things I missed in this article. And of course, please leave any questions or comments you have and I’ll do my best to respond.