Website scraping is a common problem with a common toolset. Two approaches dominate the web today:
- Automate a browser to navigate a site programmatically, using tools like Puppeteer or Selenium.
- Make an HTTP request to a website, retrieving data on the page using tools like Cheerio or BeautifulSoup.
The first approach — driving a real browser programmatically — is typical for projects where you’re running automated website tests, or capturing screenshots of your site.
In this tutorial, we’re going to scrape a website using Cheerio and Node.js. We’ll run our code on Pipedream (I’m on the team building Pipedream). I’ll also show you how to send yourself an email with some content from the page, and you’ll see how to save it to an Amazon S3 bucket for future analysis.
What is Pipedream?
Pipedream is a developer automation platform that lets you run any Node code, for free, without managing a server. You write the code, Pipedream runs it. Your code can run as a cron job, or be triggered by an HTTP request (like a webhook from some SaaS service).
I taught a programming bootcamp before Pipedream. Students learned how to scrape websites, analyze data from Twitter, and build highly complex apps on their local machine. But it took hours for them to deploy that same app to a place where it could run as a cron job or as a public-facing website. They’d never run a server, or worked with cloud platforms, which made it harder to push their work live.
Pipedream provides a hosted environment to run Node code. There’s no server to run or cloud resources to provision. You sign up with Github or Google, write code, and we run that code for you. You get built-in logging, error handling, and more. It’s a lot like AWS Lambda or other cloud functions services, but simpler to use.
Enough about that. Let’s get to the code!
Step 1 — Use axios and Cheerio to scrape example.com
https://example.com is the simplest possible webpage, so it’s a great site to reinforce scraping fundamentals.
We’ll use 2 npm packages to scrape this site:
axios— make the HTTP request to GET https://example.com .
cheerio— parse the HTML that’s returned, so we can grab specific content on the page.
Here’s the code:
fetchHTML function makes an HTTP GET request to whatever URL you pass it using
axios.get, downloading the site’s HTML.
cheerio.load loads that HTML as a DOM object we can use parse its the website’s content.
If you’re familiar with JQuery, you’ll feel at home with Cheerio. Cheerio implements the
$ object, using the same concepts for selecting specific elements from the DOM (your webpage).
$.html() “renders” the webpage. In other words, it returns a string representation of the HTML on the page.
$('h1').text()returns the text within the first
h1 tag. You can use other selectors like this to find elements that meet some conditions (e.g. elements with some class, or id), then read or modify them using methods like
remove. See the Cheerio docs to learn more.
Running this code on Pipedream
Let’s run this code to see how this works. Open this Pipedream scraping workflow in a new tab.
You’ll see two steps in this workflow:
The Cron Scheduler Source lets us run any code on a schedule. The Run Node.js Code step below the source includes the code we reviewed above.
Click the green Fork button in the top-right to create a copy of this workflow in your Pipedream account:
If you haven’t signed up, you’ll be asked to. Login happens through your Github or Google account. You can run this workflow up to 25 million times per month for free.
The code in this workflow is public . You can share the URL of your workflow with anyone, and they can fork and use it just like you did with mine. When your scraping code runs, however, all logs and data are private to your account.
Notice that your cron job is turned Off by default. This lets you test and modify the code before you turn it on. To manually run your workflow, click the Run Now button:
This runs your forked workflow with the press of a button. No need to
npm install axios or Cheerio. No need to deploy your code somewhere. Pretty cool.
Once it’s done, you’ll see how long it took to run:
If you scroll to the bottom of the code step, you’ll see the HTML and
h1 tag we pulled from https://example.com :
This fork is yours to modify. Change the https://example.com URL to your own URL. Play around with Cheerio selectors to get just the content you need, then Save and Run Now anytime you’d like to test your code.
Once your code looks good, you can schedule the cron job to run whenever you’d like by selecting the appropriate option from the Cron Scheduler source:
Step 2 — Send the results somewhere
You’ll probably want to do more than
console.log the content you just parsed. You may want to email yourself some of the content, or save it for later analysis.
We’ll walk through two examples here:
- Email yourself page content
- Save content to Amazon S3
Email yourself page content
This Pipedream workflow implements the same scraping logic as above, but it also emails you the content of the
h1 tag of your site. You can fork that workflow and click Run Now just like you did above.
sends this email:
$send.email is a built-in Pipedream function — available in any code step — that you can use to send yourself an email. Pass a
text property like I do above to send a plaintext body, or
html to send HTML emails. Read more about
The primary limitation is that you can only email yourself (the email address tied to the account you signed up with). If you need to email someone else, you can use the Nodemailer package or any transactional email service, like Sendgrid or Mandrill.
Save results to Amazon S3
This workflow implements the same scraping logic as above, but it also stores the full HTML of the page you scraped in an Amazon S3 bucket.
Amazon S3 lets you store any data — HTML documents, JSON, anything — cheaply and securely in the cloud. If you need to store and analyze the data you get from web scraping, it’s common to store them in a place like S3.
Once that’s done, add your bucket name to the Bucket field in the Send to Amazon S3 action:
Click Run Now to run your job like you did above. It’ll take roughly 60 seconds for the data to get delivered to your S3 bucket. Once you see a Success message in the Result section below the S3 action:
you should see the HTML in your bucket, within the
Notice that we stored our HTML in a property of the
$event object in the code step:
$event.html = $.html()
html property of
$event, then reference it in the Payload field of the S3 action.
The name of your S3 bucket will not be visible when users view your public workflow, but the Prefix and Payload parameters (e.g.
$event.html) will be so that others can use those default values. For example, this is what you’ll see on my public view of the workflow:
We’d also love to hear what you think of Pipedream — please reach out or comment below with any questions or feedback. You can read more about the Pipedream Node.js execution environment or the platform at large.