Make a Web Scraper with AWS Lambda and the Serverless Framework

Published in

Northcoders

9 min readJul 23, 2017

THE application for anyone seeking donkey-related employment

In this tutorial, I’ll walk you through the basics of making a web scraper with Node.js and AWS Lambda. My scraper will check daily whether The Donkey Sanctuary has any new job listings and will send me an SMS if they do.

Before you start, you should be familiar with:

Node.js and modern JavaScript
NPM
The Document Object Model
Basic Linux command line
Basic donkey care 🦄

If you have never used Amazon Web Services before, read about Lambda here, sign up for AWS, and then take a look at this article which walks you through making your first Lambda function. The principle of AWS is that all parts of your application from storage to computing power are provisioned by Amazon and hosted in a cloud environment (i.e. on Amazon’s computers), allowing you to build serverless applications that scale automatically. You don’t have to worry about building and managing servers as Amazon does all that for you. A Lambda function is simply a function that lives in the cloud, can run whenever it’s needed and is triggered by events or API calls.

We’ll use the serverless framework to build our Lambda function. It’s okay if you haven’t used it before. I’ll explain it as I go! Later, we’ll also use another AWS feature, DynamoDB, to store data from one day to the next. Finally, we’ll integrate with Nexmo to send a text message with our daily updates.

Why a scraper?

Imagine you want to compile a list of the latest recipes that are posted on a certain website. Of course, extracting this information from the site can be easily done with your eyes by simply navigating to the page and having a look, but imagine you want to check the same thing every day and build a spreadsheet from the data over a long period. Not such an enticing task now. Time for a bit of automation!

Scraping is the term given to the process of grabbing the HTML from a page and programatically extracting information from it, for example, new recipes added, the day’s stock prices, weather reports, betting odds, sports outcomes, new job listings, etc. It’s a very handy thing to know how to do.

Step 1: Serverless Setup

Take a look at the Quick Start guide to the serverless framework and make sure you have completed the prerequisites. Serverless will take all the pain out of configuring our AWS environment and allow us to develop and test locally until we’re finally ready to deploy everything to the cloud.

Let’s make a new serverless project:

$ serverless create --template aws-nodejs --path donkeyjob$ cd donkeyjob

The project initialises with a serverless.yml file. YAML is a language often used for configuration files and it’s this file that contains all the configuration stuff for AWS. We can get rid of all the comments and make do with the following for now:

service: donkeyjobprovider:  name: aws  runtime: nodejs6.10functions:  getdonkeyjobs:    handler: handler.getdonkeyjobs

Because we said we wanted to have a function called getdonkeyjobs, we’re going to export a function with that name from handler.js. This is the function that we’ll eventually deploy to AWS and will trigger every day to get our job listings.

Over in handler.js let’s create that basic function. Lambda functions take an event, a context and a callback. Right now, let’s just add some boilerplate. The rest of the file can be deleted.

And let’s test it locally (note that we won’t be deploying anything to AWS for a little while yet)…

$ serverless invoke local --function getdonkeyjobs

And we should see our “Hello world” response.

Step 2: Scrape, scrape, scrape!

Let’s build out the scraping functionality. My aim is to make a request to the Donkey Sanctuary Jobs page and parse the HTML to generate an array of jobs in the format:

We use axios to make the request for the page contents and then pass the HTML string onto a parsing function which can be tested. Inside the parsing function we use the library cheerio to parse the HTML and get the desired information. Cheerio works a lot like jQuery, but it’s perfect for serverside use because you can feed it an HTML string (i.e. the response you receive from a GET request for a page) and it will create a document object model that it can traverse and manipulate. Moment is a handy library for working with dates and allows us to easily create a standard ISO String format.

In order to use cheerio, we need to know how precisely to traverse the DOM and how to select the elements we want. To do this, you’ll need to spend some time investigating the HTML structure of the page you’re scraping using the dev tools in your browser, and you need to remember that if the structure of that HTML changes in the future, your scraper could be rendered useless.

Now if we test our function we should see our array of jobs:

$ serverless invoke local --function getdonkeyjobs

Step 3: Setup DynamoDB

Before we get carried away and think about how to use Nexmo to send an update to our user’s phone, we need to do a bit more work. We’ve found a bunch of jobs — great — but we’re going to be checking the jobs page every day once this thing’s deployed to AWS and we don’t want to send an update every day if nothing’s changed.

We need to persist our data from one day to the next to be able to figure out whether there are new jobs or not. We cannot do this with a Lambda function alone, which only allows you to save temporary data. However, it’s a perfect job for a database! And of course AWS has its very own database we can use — DynamoDB. I went with DynamoDB because it’s non-relational and simple(ish).

We first need to configure DynamoDB as an AWS resource and give our Lamdba function permission to interact with it. Now the serverless.yml looks like this:

And we’ll need to deploy this to AWS to actually create the DynamoDB resource. We can test a Lambda function locally before interacting with AWS because it’s just a function, but it makes sense that we can’t test how a database works without actually having a database.

So we run:

$ serverless deploy

Which deploys our application to AWS and creates the resources we’ve requested in the configuration file.

Step 4: Interact with DynamoDB

Now we can actually start using our database. To do this, we need to install and use a package called aws-sdk (AWS Software Development Kit) which makes interacting with a DynamoDB easier. You can see these docs for examples of how to Create, Read, Update and Delete using the AWS SDK.

Here’s what we want to do when we scrape a new list of jobs:

Retrieve yesterday’s jobs from the database (at the beginning, there won’t be any) with the dynamo.scan method

NB. We’ll be storing just one single Item at a time in the database — yesterday’s jobs. We never need to store more than that. We’ll store this item in the format:

{
   jobs: [ {job: 'Donkey Feeder',
            closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
            location: 'Leeds, UK'},
           {job: 'Chef',
            closing: 'Fri Jul 21 2017 00:00:00 GMT+0100',
            location: 'Sheffield, UK'}
         ],
   listingId: 'Fri Jul 21 2017 14:25:35 GMT+0100 (BST)'
}

Compare to see what the difference is between yesterday’s jobs and today’s jobs using some handy lodash methods
We delete yesterday’s jobs from the database with the dynamo.delete method.
We save the new jobs instead with the dynamo.put method.
We call the callback with the new jobs

We can test our function locally by running

$ serverless invoke local --function getdonkeyjobs

And we should see our callback is called with an array of all the jobs listed on The Donkey Sanctuary today, because as far as we are concerned they are all ‘new’. There were no jobs already in our database from yesterday.

If you go to the AWS console now, go to DynamoDB, find your donkeyjobs table and take a look at the items, you should see today’s data saved there.

If you run the function locally again, you should see that the jobs array is empty. That’s because this time we’re comparing the jobs against what’s already in the database, and unless a new job was added in the last few minutes, nothing has changed.

Step 5: Send a text with Nexmo

Now we have a list of new jobs, let’s post an SMS to our user informing them of all the exciting donkey jobs they could be busy applying for!

First, sign up with Nexmo. It gives you $2 credit for free to play around with, which is plenty. Once you sign up, you should end up at the dashboard which gives you a key and a secret. You’ll need these to send a text message from Nexmo.

We can use the nexmo npm package to handle the request to send a text really easily. Install it and bring it into your handler.js file. Before calling the final callback on our getdonkeyjobs handler, we can send any text message we want:

To test this out, we’ll have to empty the DynamoDB table so that the Lambda actually does think there are new jobs (we can do this from the AWS console, as below) and then we can run our function locally again.

And with any luck, we should have received a text message!

Now the last thing to do is format our text message a little better. We can create a new helper function for this, which receives the list of jobs and returns a formatted message listing out the deadlines, locations and job titles of everything that’s available.

Remember that we’ll have to keep clearing the table when we want to test our function (there are probably better ways to do this, but it’s easy enough to just delete the Item on the AWS console for now).

And now we’re finally done, we can deploy our whole application for the final time on AWS:

$ serverless deploy

Step 6: Configure the Lambda to run every day

Once we’ve deployed the function, we can test it to check everything works:

And we can also choose to run the function automatically once a day. We can select ‘Add Trigger’, choose ‘CloudWatch Events’ from the list and then fill in the required fields. We can use the schedule expression rate(1 day) to run it daily.