Automated Web Scraping Solution with Puppeteer and Express.js for BetterHelp Counselor Data

Published in

Zetta Tech

4 min readMay 28, 2023

If you’re looking for a solution using Puppeteer and Express.js to scrape counselor data from BetterHelp’s counselor directory, you’ve come to the right place. In this article, I’ll guide you through creating a web scraping application that allows you to run the script yourself and obtain the results within a few hours. We’ll leverage Puppeteer for web scraping and Express.js for creating a simple server to trigger the scraping process. Let’s get started!

Prerequisites

Before we begin, ensure you have Node.js and npm (Node Package Manager) installed on your machine. You can download them from the official Node.js website: https://nodejs.org

Step 1: Set up the project

Create a new project directory for your web scraping application.
Open a terminal or command prompt and navigate to the project directory.
Initialize a new Node.js project by running the following command:

npm init -y

This will create a package.json file in your project directory.

Step 2: Install the dependencies

We need to install Puppeteer and Express.js as dependencies for our project. Run the following command to install them:

npm install puppeteer express

This command will install Puppeteer and Express.js, and their dependencies, in the project directory.

Step 3: Create the Express.js server

Create a new file called server.js in your project directory.
Open server.js in a text editor and add the following code:

const express = require('express');
const app = express();
const puppeteer = require('puppeteer');

const PORT = 3000; // Choose the desired port number

app.get('/', async (req, res) => {
  try {
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Your scraping logic with Puppeteer goes here

    await browser.close();

    res.send('Scraping completed successfully!');
  } catch (error) {
    console.error('Scraping failed:', error);
    res.status(500).send('Scraping failed.');
  }
});

app.listen(PORT, () => {
  console.log(`Server is running on http://localhost:${PORT}`);
});

Step 4: Implement the scraping logic with Puppeteer

Inside the route handler in server.js, we'll use Puppeteer to navigate to the JSON feed URL and visit each counselor's page to extract the required information. Update the code inside the try block as follows:

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.betterhelp.com/api/counselor_directory?show_all=1');

const data = await page.evaluate(() => {
  // Your scraping logic here
  // Extract the required information from the page and store it in a JavaScript object or array

  return scrapedData;
});

await browser.close();

Step 5: Extract the required information

Customize the scraping logic inside the page.evaluate() function to extract the desired information. Use Puppeteer's functions to navigate the page and select elements based on CSS selectors. Store the extracted information in a JavaScript object or array, scrapedData.

Step 6: Run the server

Save the changes to server.js. Open a terminal or command prompt, navigate to your project directory, and run the following command:

node server.js

The server will start running on http://localhost:3000 (or the specified port if you changed it). You can access the scraping functionality by visiting the root URL (http://localhost:3000/).

Step 7: Detailed Specification and CSV Generation

To generate the CSV files as per your detailed specifications, you need to update the scraping logic inside the page.evaluate() function to extract the required information and store it in a JavaScript object or array.

For example, to extract information from the top part of the page, you can update the scraping logic as follows:

const data = await page.evaluate(() => {
  const counselors = Array.from(document.querySelectorAll('.counselor-card'));

  const topInfo = counselors.map((counselor) => {
    const counselorName = counselor.querySelector('.counselor-card__name').innerText;
    const specialization = counselor.querySelector('.counselor-card__specialization').innerText;
    const location = counselor.querySelector('.counselor-card__location').innerText;
    const experience = counselor.querySelector('.counselor-card__experience').innerText;
    const bio = counselor.querySelector('.counselor-card__bio').innerText;

    return { counselorName, specialization, location, experience, bio };
  });

  return topInfo;
});

To extract information from the reviews part of the page, you can update the scraping logic as follows:

const data = await page.evaluate(() => {
  const reviews = Array.from(document.querySelectorAll('.counselor-review'));

  const reviewsInfo = reviews.map((review) => {
    const reviewerName = review.querySelector('.counselor-review__name').innerText;
    const rating = review.querySelector('.counselor-review__rating').innerText;
    const content = review.querySelector('.counselor-review__content').innerText;

    return { reviewerName, rating, content };
  });

  return reviewsInfo;
});

Once you have the extracted information in the data variable, you can generate the CSV files using a library like csv-writer. Install it by running the following command:

npm install csv-writer

Then, update the route handler in server.js to generate and save the CSV files:

const createCsvWriter = require('csv-writer').createObjectCsvWriter;

// ...

app.get('/', async (req, res) => {
  try {
    // ...

    const csvWriter = createCsvWriter({
      path: 'top_info.csv',
      header: [
        { id: 'counselorName', title: 'Counselor Name' },
        { id: 'specialization', title: 'Specialization' },
        { id: 'location', title: 'Location' },
        { id: 'experience', title: 'Experience' },
        { id: 'bio', title: 'Bio' },
      ],
    });

    await csvWriter.writeRecords(data);

    // Generate and save the second CSV file for reviews information

    res.send('Scraping completed successfully!');
  } catch (error) {
    // ...
  }
});

Step 8: Throttling the Scrape Speed

To throttle the scraping speed and avoid overwhelming the server or violating any rate limits, you can introduce a delay between each request. One way to achieve this is by using await with setTimeout() inside a loop. For example, you can modify the loop that visits each counselor's URL as follows:

for (const counselor of counselors) {
  const counselorSlug = counselor.getAttribute('slug');
  const counselorURL = `https://www.betterhelp.com/${counselorSlug}`;

  // Visit the counselor's URL and extract information

  // Add a delay of 1 second before the next iteration
  await new Promise((resolve) => setTimeout(resolve, 1000));
}

This will introduce a 1-second delay between each request, allowing for throttling the scraping speed.

Conclusion

By following the steps outlined in this article, you can create a web scraping application using Puppeteer and Express.js to scrape counselor data from BetterHelp’s counselor directory. You will be able to run the script yourself, generate CSV files containing the required information, and control the scraping speed by introducing delays between requests. Remember to handle errors gracefully and customize the scraping logic based on your specific requirements. Happy scraping!