Scrape a site with Node and Cheerio in 5 minutes

Dylan Sather
7 min readAug 14, 2019

--

Website scraping is a common problem with a common toolset. Two approaches dominate the web today:

  • Automate a browser to navigate a site programmatically, using tools like Puppeteer or Selenium.
  • Make an HTTP request to a website, retrieving data on the page using tools like Cheerio or BeautifulSoup.

The first approach — driving a real browser programmatically — is typical for projects where you’re running automated website tests, or capturing screenshots of your site.

The second approach has limitations. For example, Cheerio “is not a browser” and “does not produce a visual rendering, apply CSS, load external resources, or execute JavaScript”. But this approach is simple, and often sufficient, especially when you’re learning how scraping works.

Scraping 101: fetch sample HTML, extract some basic text from it

In this tutorial, we’re going to scrape a website using Cheerio and Node.js. We’ll run our code on Pipedream (I’m on the team building Pipedream). I’ll also show you how to send yourself an email with some content from the page, and you’ll see how to save it to an Amazon S3 bucket for future analysis.

To get the most out of this tutorial, you should know how to read basic Node (or JavaScript), and understand HTTP.

What is Pipedream?

Pipedream is a developer automation platform that lets you run any Node code, for free, without managing a server. You write the code, Pipedream runs it. Your code can run as a cron job, or be triggered by an HTTP request (like a webhook from some SaaS service).

A quick overview of Pipedream. Click on the GIF to learn more.

I taught a programming bootcamp before Pipedream. Students learned how to scrape websites, analyze data from Twitter, and build highly complex apps on their local machine. But it took hours for them to deploy that same app to a place where it could run as a cron job or as a public-facing website. They’d never run a server, or worked with cloud platforms, which made it harder to push their work live.

Pipedream provides a hosted environment to run Node code. There’s no server to run or cloud resources to provision. You sign up with Github or Google, write code, and we run that code for you. You get built-in logging, error handling, and more. It’s a lot like AWS Lambda or other cloud functions services, but simpler to use.

Enough about that. Let’s get to the code!

Step 1 — Use axios and Cheerio to scrape example.com

https://example.com is the simplest possible webpage, so it’s a great site to reinforce scraping fundamentals.

example.com

We’ll use 2 npm packages to scrape this site:

Here’s the code:

The fetchHTML function makes an HTTP GET request to whatever URL you pass it using axios.get, downloading the site’s HTML. cheerio.load loads that HTML as a DOM object we can use parse its the website’s content.

If you’re familiar with JQuery, you’ll feel at home with Cheerio. Cheerio implements the $ object, using the same concepts for selecting specific elements from the DOM (your webpage).

$.html() “renders” the webpage. In other words, it returns a string representation of the HTML on the page.

$('h1').text()returns the text within the first h1 tag. You can use other selectors like this to find elements that meet some conditions (e.g. elements with some class, or id), then read or modify them using methods like add or remove. See the Cheerio docs to learn more.

Running this code on Pipedream

Let’s run this code to see how this works. Open this Pipedream scraping workflow in a new tab.

You’ll see two steps in this workflow:

Run this Node code on a schedule

The Cron Scheduler Source lets us run any code on a schedule. The Run Node.js Code step below the source includes the code we reviewed above.

Click the green Fork button in the top-right to create a copy of this workflow in your Pipedream account:

If you haven’t signed up, you’ll be asked to. Login happens through your Github or Google account. You can run this workflow up to 25 million times per month for free.

The code in this workflow is public . You can share the URL of your workflow with anyone, and they can fork and use it just like you did with mine. When your scraping code runs, however, all logs and data are private to your account.

Notice that your cron job is turned Off by default. This lets you test and modify the code before you turn it on. To manually run your workflow, click the Run Now button:

Click Run Now, and you’ll see the logs show up once your job finishes

This runs your forked workflow with the press of a button. No need to npm install axios or Cheerio. No need to deploy your code somewhere. Pretty cool.

Once it’s done, you’ll see how long it took to run:

If you scroll to the bottom of the code step, you’ll see the HTML and h1 tag we pulled from https://example.com :

This fork is yours to modify. Change the https://example.com URL to your own URL. Play around with Cheerio selectors to get just the content you need, then Save and Run Now anytime you’d like to test your code.

Once your code looks good, you can schedule the cron job to run whenever you’d like by selecting the appropriate option from the Cron Scheduler source:

You can use cron expressions to schedule your job at any frequency

Read more about cron jobs in our docs.

Step 2 — Send the results somewhere

You’ll probably want to do more than console.log the content you just parsed. You may want to email yourself some of the content, or save it for later analysis.

We’ll walk through two examples here:

  • Email yourself page content
  • Save content to Amazon S3

Email yourself page content

This Pipedream workflow implements the same scraping logic as above, but it also emails you the content of the h1 tag of your site. You can fork that workflow and click Run Now just like you did above.

This code:

sends this email:

$send.email is a built-in Pipedream function — available in any code step — that you can use to send yourself an email. Pass a text property like I do above to send a plaintext body, or html to send HTML emails. Read more about $send.email here.

The primary limitation is that you can only email yourself (the email address tied to the account you signed up with). If you need to email someone else, you can use the Nodemailer package or any transactional email service, like Sendgrid or Mandrill.

Save results to Amazon S3

This workflow implements the same scraping logic as above, but it also stores the full HTML of the page you scraped in an Amazon S3 bucket.

Amazon S3 lets you store any data — HTML documents, JSON, anything — cheaply and securely in the cloud. If you need to store and analyze the data you get from web scraping, it’s common to store them in a place like S3.

To use this workflow, you’ll need an existing S3 bucket where you want to store your data. You’ll also need to add this Bucket Policy to that bucket to allow Pipedream to store data there.

Once that’s done, add your bucket name to the Bucket field in the Send to Amazon S3 action:

Add your bucket name in the Bucket field. You can modify the Prefix and Payload fields, too.

Click Run Now to run your job like you did above. It’ll take roughly 60 seconds for the data to get delivered to your S3 bucket. Once you see a Success message in the Result section below the S3 action:

you should see the HTML in your bucket, within the website-scraping-data prefix.

Notice that we stored our HTML in a property of the $event object in the code step:

$event.html = $.html()

$event — “dollar event” — is a JavaScript object you can use to store data between steps of a workflow. Here we save the HTML from our site to an html property of $event, then reference it in the Payload field of the S3 action.

The name of your S3 bucket will not be visible when users view your public workflow, but the Prefix and Payload parameters (e.g. $event.html) will be so that others can use those default values. For example, this is what you’ll see on my public view of the workflow:

Learn more

axios and Cheerio both have detailed docs you should read to learn more about those tools. If you have any questions about the code above, feel free to comment below!

We’d also love to hear what you think of Pipedream — please reach out or comment below with any questions or feedback. You can read more about the Pipedream Node.js execution environment or the platform at large.

--

--

Dylan Sather

Building https://pipedream.com . I love making programming simple for beginners and experts alike!