A Step-by-Step Guide to Scraping Data with Puppeteer and Node.js

Arif Rahman
Zetta Tech
Published in
5 min readMay 8, 2023

Web scraping is the process of extracting data from websites. In some cases, it may be necessary to automate this process and extract data from multiple pages or websites. This is where Puppeteer comes in.

Puppeteer is a Node.js library that provides a high-level API to control headless Chrome or Chromium over the DevTools protocol. With Puppeteer, we can automate tasks such as clicking buttons, filling out forms, and scraping data from websites.

In this article, we will use Puppeteer and Node.js to scrape data from a website. We will walk through the code provided and explain how it works, step by step.

Prerequisites

Before we get started, you will need to have Node.js and Puppeteer installed on your machine. To install Node.js, visit the Node.js website and download the appropriate version for your operating system. To install Puppeteer, run the following command in your terminal:

npm install puppeteer

Once you have Puppeteer installed, you can start using it to automate tasks and scrape data from websites.

Getting Started

Let’s start by looking at the code provided. The code imports two Node.js modules: puppeteer and fs. The puppeteer module provides the API for controlling headless Chrome, while the fs module provides the API for interacting with the file system.

const puppeteer = require('puppeteer');
const fs = require('fs');

Next, the code creates an anonymous async function that will contain all of our scraping logic. This function is immediately invoked using an async IIFE (Immediately Invoked Function Expression).

(async () => {
// Code goes here
})();

Inside the async function, we create a new browser instance using puppeteer.launch(). This method returns a Promise that resolves to a Browser object, which we store in the browser variable.

const browser = await puppeteer.launch({
headless: false,
args: [
'--start-maximized',
`--disable-blink-features=AutomationControlled`,
`--disable-web-security`,
`--allow-running-insecure-content`
],
defaultViewport: null
});

The launch() method takes an options object as its argument. In this case, we are setting headless to false, which means that a visible browser window will be opened. We are also passing some additional arguments, such as --start-maximized, which maximizes the window, and --disable-blink-features=AutomationControlled, which disables certain browser features that can interfere with automation.

Next, we create a new page instance using browser.newPage(). This method returns a Promise that resolves to a Page object, which we store in the page variable.

const page = await browser.newPage();

We set the user agent of the page using page.setUserAgent(). The user agent is the string that the browser sends to websites to identify itself. By setting a specific user agent, we can make the browser appear as a different browser or device to the website.

await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
);

Now that we have a page instance, we can navigate to a website using page.goto(). In this case, we are navigating to the license lookup page for optometrists in South Carolina.

await page.goto('https://verify.llron.gov/LicLookup/Optician/Optician.aspx');

Once the page has loaded, we can interact with it using Puppeteer’s API. In this case, we want to fill out the form on the page with some search criteria and click the search button to retrieve a list of optometrists.

We start by finding the relevant form elements on the page using page.$(). This method takes a CSS selector as its argument and returns a Promise that resolves to the first element that matches the selector. We store the form elements in variables for later use.

const lastNameInput = await page.$('#txtLastName'); 
const firstNameInput = await page.$('#txtFirstName');
const cityInput = await page.$('#txtCity');
const licenseNumberInput = await page.$('#txtLicNumber');
const searchButton = await page.$('#btnSearch');

Next, we fill out the form elements using page.type(). This method takes two arguments: the first is the element to type into, and the second is the text to type.

await lastNameInput.type('Doe'); 
await firstNameInput.type('John');
await cityInput.type('Columbia');
await licenseNumberInput.type('4');

Finally, we click the search button using page.click(). This method takes a single argument, which is the element to click.

await searchButton.click();

Once we have clicked the search button, the page will reload with the search results. We can wait for the page to finish loading using page.waitForNavigation(). This method returns a Promise that resolves when the page finishes loading.

await page.waitForNavigation();

Now that the search results have loaded, we can scrape the data we are interested in. In this case, we want to scrape the name, license number, and status of each optometrist in the search results.

We start by finding the table element that contains the search results using page.$(). We then find all the rows in the table using the elementHandle.$$(‘tr’) method. This method returns a Promise that resolves to an array of ElementHandle objects, one for each row in the table.

const table = await page.$('#ResultsGrid'); 
const rows = await table.$$('tr');

We can then loop through the rows and scrape the data we need using the elementHandle.$() method to find specific elements within each row.

for (const row of rows) { 
const columns = await row.$$('td');
const name = await columns[0].$eval('a', el => el.textContent.trim());
const licenseNumber = await columns[1].$eval('span', el => el.textContent.trim());
const status = await columns[2].$eval('span', el => el.textContent.trim());
console.log({ name, licenseNumber, status });
}

In this code, we use the $eval() method to evaluate a function on the element matched by the CSS selector. The function takes the matched element as its argument and returns the text content of the element, trimmed of any whitespace.

Finally, we close the browser using browser.close().

await browser.close();

Putting it All Together

Here is the complete code for scraping data from the South Carolina optician license lookup page:

const puppeteer = require('puppeteer');
const fs = require('fs');
(async () => {
const browser = await puppeteer.launch({
headless: false, args: ['--start-maximized',
`--disable-blink-features=AutomationControlled`,
`--disable-web-security`,
`--allow-running-insecure-content`],
defaultViewport: null
});

const page = await browser.newPage();

await page.setUserAgent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36'
);

await page.goto('https://verify.llronline.com/LicLookup/optometry/LicLookup.aspx?div=24');

await page.select('#cphMainContentArea_ddlLicType', 'OD');
await page.select('#cphMainContentArea_ddlLicStatus', 'Active');
await page.click('#cphMainContentArea_btnSearch');

await page.waitForNavigation();

const data = await page.evaluate(() => {
const rows = Array.from(document.querySelectorAll('#cphMainContentArea_tblLicenseList tr'));
return rows.slice(1).map(row => {
const columns = row.querySelectorAll('td');
return {
licenseNumber: columns[0].textContent.trim(),
lastName: columns[1].textContent.trim(),
firstName: columns[2].textContent.trim(),
status: columns[3].textContent.trim(),
expirationDate: columns[4].textContent.trim()
};
});
});

console.log(data);

fs.writeFile('output.json', JSON.stringify(data), err => {
if (err) throw err;
console.log('Data written to file');
});

await browser.close();
})();

Let’s break down this code step by step:

  1. We import the puppeteer and fs modules.
  2. We create an anonymous async function using an async IIFE.
  3. We launch a new browser instance using puppeteer.launch() and store it in the browser variable.
  4. We create a new page instance using browser.newPage() and store it in the page variable.
  5. We set the user agent of the page using page.setUserAgent().
  6. We navigate to the South Carolina optician license lookup page using page.goto().
  7. We select the OD license type and Active status using page.select(), and click the search button using page.click().
  8. We wait for the page to navigate to the search results page using page.waitForNavigation().
  9. We extract the data from the search results table using page.evaluate().
  10. We log the data to the console and write it to a file using fs.writeFile().
  11. We close the browser using browser.close().

Conclusion

In this article, we have demonstrated how to use Puppeteer and Node.js to scrape data from a website. We walked through the code step by step, explaining how each part works. With Puppeteer, you can automate tasks and extract data from websites with ease. However, it’s important to use web scraping responsibly and ethically, and to always respect the website’s terms of service and robots.txt file.

--

--