fA Brief Introduction to Web Scrapping with Puppeteer

Published in

Yellowme

5 min readJul 28, 2020

In my current job we are working with a bunch of information from the internet (for analytics purposes) and we always need to recover some specific data from a variety of websites.

One of my tasks at my job is to retrieve this data and transform it into a traditional format. When I worked on this I was thinking: that’s simple, I just need to find some resources like: web service or files and do an http request call and voila!.

I had always worked consuming http endpoints or downloading files. It’s the most common way to reach data nowadays. But I found some websites which don’t have any of these options. Something like this happened in my head:

So, I did some research into how to read the raw data from an html page and return the most important information. Finally, I found a nice Nodejs library to do it. That was Puppeteer.

What is Puppeteer?

Puppeteer is an open-source library for Nodejs that allows us to control Chrome or Chromium API with the web browser devtools.

puppeteer/puppeteer

Headless Chrome Node.js API. Contribute to puppeteer/puppeteer development by creating an account on GitHub.

github.com

It’s perfect for these cases:

Generate PDF files and take screenshots.
Read HTML elements from urls.
Automate and simulate data entry to forms, keyboard events, DOM events, etc.
Create E2E testing.

Requirements

Nodejs
Yarn (optional).

Note: Prior to v1.18.1, Puppeteer required at least Node v6.4.0. Versions from v1.18.1 to v2.1.0 rely on Node 8.9.0+. Starting from v3.0.0 Puppeteer starts to rely on Node 10.18.1+. All examples below use async/await which is only supported in Node v7.6.0 or greater.

Getting started

First of all, we need to initialize a Nodejs project. Run this command on any part of your file system through your terminal:

> yarn init

Next, we need to install the puppeteer dependency typing this command:

> yarn add puppeteer

The last step is to add a new file called index.js:

> touch index.js

Excellent! Now we are ready to work!

Let’s get started

In this example, I’m going to work on one of my favorite manga’s readers of all time: MangaPlus. This page is a bit challenging because all of the information is asynchronous and some images have lazy load so we need to figure out how to make it work and find a way to recover it.

Official website of Shone Jump (MangaPlus).

Let’s write the initial code!

Do you remember when I mentioned about remove Google Devtools? the Puppeteer API allows us to open Google Chrome Browser and redirect to the page we’ll work on. It’s pretty easy:

And this is the result:

Customize the viewport size

The current result looks a little bit weird. The web browser and page looks smaller than a native Google Chrome browser. Puppeteer works with a predefined viewport and browser size by default. If you wanna change these values (because the current behavior affects some functionalities), you will need to add more options to the launch method like these:

Now it looks more awesome!

List the popular mangas

Our first goal is to retrieve the list of mangas (right side). Puppeteer runs vanilla Javascript behind the scenes.. This method allows us to run some code as if we were working with the Google Chrome Devtools:

await page.evaluate()

So, we must find the container selector of the list of mangas, filter the information and return the result like this:

You can paste the inner code of the evaluate method to another Google Browser and it should be working! But wait… the console throws us a different result than the web browser. What happened?

It’s simple: the html data runs asynchronously, in other words, when the page has ended loading it’s very possible that the information is still loading. To fix this, Puppeteer has another nice method that allows us to “wait for something” and when the element or condition is finished, runs the next code:

await page.waitForFunction()

The next code means: if this element appears on the HTML page, please run the next lines:

Run the code:

Fix the image source

The result still has flaws. Details can be interpreted as more information. Each object has the same image url. If we look at the image behavior on the page, we will notice that each element of the list doesn’t load the image of the first load instead, it shows a loading spinner and then the shows the correct image. And if we scroll down the page it has the same behavior.

We can do some recursive method and use the waitForFunction helper or we can simulate a scroll down and wait some seconds until we reach the bottom of page:

Redirect to another pages

Great! Our goal is complete! If you wish you can play with the url links and get more information about each manga. To avoid some missing data issues, use this code:

await Promise.all([
  page.waitForNavigation(),
  page.goto(manga.url)
])

The above code tells: “Please, redirect to the url manga and wait for it to finish loading”. It’s very important to use these lines every time you need to go to another page.

Now the information looks more complete!

Conclusion

In this post we learned:

Puppeteer it’s a robust, easy and beautiful Nodejs tool to read HTML pages and return specific information using css selectors or simulate the page behavior.
We can work with synchronous or asynchronous information and wait for certain elements.
How to redirect from the current page into another and how to correctly wait for it to complete the first load of data.

Thank you for reading!

Resources:

https://github.com/Kami-Juan/mangaplus-scrapper