Web Scraping with Phantom.js on AWS Lambda

Recently I have been working on an idea I had for a better job search site, vantechjobs.

The main idea was to give a better sense of a job by pulling keywords & showing them as a preview. This means you could have 10 jobs with the title ‘Software Engineer’, but because we can see the keywords from the advertisement, the difference between each of them is immediately clear.

Preview of a job search site I’ve been working on: https://www.vantechjobs.co/

To test this idea, I needed a large enough set of job advertisements to test my app’s keyword extraction logic. Many APIs I found online which offered job advertisements did not support a free tier, or were limited in what kind of information they could offer. Web scraping offered a simple, free way to get the data needed to build out this prototype.

The implementation I have is an AWS Lambda function using Phantom.js and Cheerio. I’m pulling jobs from the Amazon careers site.

I’ve put a sample version of this on Github. This post explains that implementation.

When starting out with web scraping, the most obvious place to start is with simple HTTP requests to the site you’re interested in. This is fine in some cases, but most likely the site you’ll be accessing is a Single Page Application using JavaScript, so you’ll need a way to execute the JavaScript on the page to get the results you’re expecting.

Phantom allows us to do this. However, unlike many packages you’ll get from NPM, Phantom must be complied for the platform it is being deployed on. This means that we need to build the project in a Linux environment for our app to run on AWS Lambda.

Phantom fires an event when the page has finished loading, however if there are subsequent API calls being made by the web app, we will need to wait for these to finish before capturing the result. We can do this by adding an arbitrary wait after the finished loading event has fired.

Once we have the HTML of the page, I want to extract all the URLs for individual jobs. Amazon is kind enough to place a HTML class on each of these. Using Cheerio, we can parse the HTML, then use JQuery selector syntax to get the DOM elements we’re interested in, then extract the HREF attribute.

var $ = cheerio.load(html);

var urls = [];

$('.job-link').filter(function () {
let url = $(this).attr("href");

Now that we have all the URLs, we can make a request for each of these, then use Cheerio again to get the main content section.

To deploy to AWS Lambda, we need a ZIP package containing the project and its production dependencies. The aws-sdk package is available in the Lambda environment by default, so we can exclude it. The build.sh script in this project creates the ZIP package we need.

Once we deploy the ZIP from S3 to Lambda, we need to modify a few settings in Lambda. By default Lambda uses a timeout of 3 seconds and assigns 128mb of memory to the process. We can increase the timeout to 2 minutes & update the memory allowance to 1024mb.

With the content of these jobs, I can generate the relevant keywords, which I used to show a more useful preview of the job advertisement.

To keep the search site fresh, we want to run this app on a regular basis, pulling in new job advertisements each time. AWS Lambda allows us to define a schedule from when the app should run. An app like this is well suited to AWS Lambda, as we are only paying for the time the app is executing, which is well within AWS’ Free Tiers.