Monitor Websites with Cheerio and DataFire

Bobby Brennan
DataFire.io
Published in
6 min readSep 20, 2017

One of my favorite new features in DataFire is full NodeJS support, including npm modules. Tasks that used to be cumbersome can now be delegated to one of the healthiest open-source communities out there.

One library I consistently add to my DataFire projects is cheerio, a jQuery-like HTML parser. Cheerio makes it wonderfully easy to convert an HTML page into structured data. When combined with DataFire, that structured data can fuel a REST API, trigger alerts, or be stored for analysis.

As an example, let’s set up a DataFire project that retrieves download stats for projects on npm. We’ll use the DataFire.io website to build the project, but if you’re more comfortable coding you can follow along using the open-source framework as well. You can also fork this project on GitHub or on DataFire.io

Here are the steps we’ll take:

  1. Create a new project on DataFire, and add the cheerio library
  2. Create an action that scrapes the stats for an npm package from it’s page on npmjs.com
  3. Create a path trigger (i.e. a URL), GET /{package_name}/stats, which will respond with the stats for that package
  4. Bonus: create a task trigger that will ping our Slack channel every morning with the latest stats for the DataFire package

1. Creating the Project

First, log into your DataFire account and create a new project. Head over to Integrations -> NPM to add the cheerio library:

This will allow us to call require('cheerio') inside our action.

2. Creating the Action

Now head to the Actions tab and create a new action:

Now let’s add the package_name input. Head to the action’s Setup page and click Add an input, setting the title to package_name. You can also add some validation to sanity-check the input:

Now we need to tell the action what to do. Click Next to go to the Steps page, and add a step that uses the action http/get to retrieve the package’s page from npmjs.com:

Note that we’ve set the url input to "https://www.npmjs.com/package/" + input.package_name. As you type, you’ll see the following code appear in the code editor:

For the next step, we’ll need a bit of custom logic which tells cheerio what to extract from the webpage. Here’s what we’ll add (lines 17–27):

To break it down:

  1. The line .then(response => { starts a new step, naming the result from the last step response.
  2. The line let $ = require('cheerio').load(response.body) passes the HTML we got from npmjs.com to the cheerio library, creating a jQuery-like interface
  3. For each of daily/weekly/monthly downloads, we extract the number as a string using a CSS selector like $('.daily-downloads').text(), and convert it to a number with the + sign.
  4. Finally, we return an object that might look something like this:
{
name: 'datafire',
downloads: {
daily: 85,
weekly: 215,
monthly: 1,898
}
}

Now let’s click Run to test the action:

Looks good! Now we can set up triggers, which will run this action in response to certain events.

3. The Path Trigger

DataFire makes it incredibly simple to deploy a REST API — path triggers will fire your actions whenever a particular URL is called. In this case, we’ll create a path trigger at GET /{package_name}/stats.

First, head to the Triggers -> Paths page to create a new path trigger:

You can make the path anything you want, but be sure to include {package_name}, which is what we named the input for our action. The get_npm_stats action you created in step 2 should already be selected.

Click Deploy to launch the API. Note that dev deployments are free, but will shut down after a period of inactivity. Alternatively, visit the Project -> Settings tab for instructions on how to download and run your project on your own hardware.

Now we can use the API! We’ve deployed a public prod server where you can test it out — try the following URLs:

https://npm-stats.prod.with-datafire.io/datafire/stats

https://npm-stats.prod.with-datafire.io/cheerio/stats

https://npm-stats.prod.with-datafire.io/express/stats

4. Bonus: Posting Stats to Slack

Lastly, we want to get an alert in our Slack channel every morning with the latest download stats for DataFire. To do this, we’ll create a new action that calls our first action, followed by slack/chatPostMessage.

The Action

Create a new action called post_stats_to_slack (see step 2 for details). Choose the get_npm_stats action for the first step, and set the label to stats:

Here we’ve set package_name to 'datafire', but you can set it to any package name.

Next click Hide step details and add a new step, setting the action to slack/chatPostMessage:

For the input, we’ll set the channel to 'general' and text to be a JSON dump of stats:

Finally, we need to add a Slack account. Head to the Integrations tab and click Add a new account under Slack:

Make sure the im:write scope is selected — we’ll need that permission in order to access slack/chatPostMessage.

Clicking Add Account will bring you to Slack, where you can add the authorization:

Now if you go back to your action and hit Run, you’ll get a new message in your Slack channel:

The Task Trigger

Last, all we have to do is tell DataFire to run the post_stats_to_slack action every day. Head to Triggers -> Tasks and create a new task:

That’s it! Deploy to prod and your task will run every 24 hours.

Wrapping Up

I can’t overstate how excited I am that npm modules are now supported in DataFire. A project like this would have been much more difficult without help from the cheerio library.

If you want to take this project further, feel free to fork it on GitHub or on DataFire.io!

--

--

Bobby Brennan
DataFire.io

I’m a Software Engineer, specializing in Dev Tools, NLP, and Machine Learning