Nightmarishly good scraping with Nightmare.js and async/await

Ændra Rininsland
Journocoders
Published in
13 min readDec 5, 2016

Website scraping is a fact of life if you’re doing data-journalism, and the traditional wisdom has always been to use something like Python’s Beautiful Soup library to make and parse requests. Ladies and gentlemen, let me tell you — the traditional wisdom is wrong! Use Nightmare.js!

“Why?”, you ask? There are ultimately two ways you can write code to scrape a webpage:

  1. Make a request against a parameterized URL, retrieve the HTML at that address, then parse that somehow. Extract useful data and links for the next request; decipher the URL structure to run it as a loop.
  2. Automate a web-browser to make requests and navigate through a site, extracting data from each page.

Nightmare.js is the latter approach, and even though it seems similar to the first, there is one significant advantage — by being an actual web-browser (It’s based on Electron, the same Google Chrome-derived framework that powers the Atom text editor) that can submit forms , run JavaScript and save cookies, your scraper will bypass a lot of the annoying code that can trip it up. Not only that, it means websites that render mostly on the client-side are still scrape-able—if ever you’ve been thrown by needing to make an AJAX request to return a form in your scraping, today is your day to be awesome!

Gif via http://giphy.com/gifs/KlJ4wKeb8VMqs. Hey, want to know what a totally not awesome thing to do before your first coffee? Search “nightmare” on Giphy. 😱

“But Ændrew!”, you say. “JavaScript is really annoying and all that asynchronous single-threaded stuff is really hard to keep track of while writing simple scrapers!”

And you’re not wrong! It can be kind of annoying and frustrating! But today we’re using Node 7 with native async/await, which dramatically simplifies writing code to manage a big series of requests! This stuff is so new you had to use Babel to get it to work until recently—let’s live dangerously and use the latest version of NodeJS to do awesome things!

Install NodeJS 7

First we need to install Node 7.x.

On OS X, I prefer n for managing NodeJS versions. Install via Homebrew:

$ brew install n

Then install the absolute most cutting-edge, latest-core version of NodeJS by simply typing:

$ n latest

💥 You’re done! 💥 Skip forward!*

*Oh, you don’t have Homebrew? Uh, you’ll need to install XCode and a bunch of other crap to get that to work, which you might just not have time for right now. Just go to the NodeJS website and use the installer. Choose the version on the right — I’m using 7.2.0, though it’s quite likely that won’t be the version listed there for longer than the next fifteen minutes. You want Node 7+ at any rate.

On Windows, I think Nodist might be your best bet? Check it out. I haven’t used Windows in like a decade though, so you probably know better than me. Also, if this all works in Windows, just know I’ll be hella impressed with myself (alas, if not, please leave a comment 😭).

Initialise a new package

Okay, let’s create a new directory and initialise a package.json file there. If you’re new to NodeJS and npm, package.json is used to manage dependencies and download stuff. It’s neato.

$ mkdir scraping-project
$ cd scraping-project
$ npm init -y

You should now be in a new directory called scraping-project/ with a blank package.json file. The bold bit above generated the latter, if you’re curious.

Let’s install Nightmare and get ready to scrape! We’ll also install d3-dsv so we can easily write our data out to CSV at the end:

$ npm install nightmare d3-dsv --save

This will download the above to the node_modules/ folder within the current directory. You can also just clone this repo and run npm install to get the completed tutorial.

Map your scraping process

It’s worth planning your scraper’s workflow before you code anything—five minutes of planning is worth an hour’s worth of code. Generally, a scraping process is broken down into several parts, but usually it starts with some sort of unique identifier. This might be a product code, a company code, a postal code or some other identifier that is granular enough to let your scraper pick out the bits you need while it runs.

In our example, we’re going to input land registry title numbers taken from a CSV file into the UK Land Registry to get the corresponding address. We’re then going to output that as a CSV for later analysis. We may have gotten these original identifiers from some other process — for instance, they may have been compiled from reading through reams of legal documents, or even from another scraper.

First, download the data from here:

https://gist.githubusercontent.com/aendrew/76f3fae770e647ae7b0ea865ba0c4418/raw/eb047d0b74cc539a7fa873b1594e28e1dc310f6f/tesco-title-numbers.csv

Save it as tesco-title-numbers.csv in your project directory.

We’re ready to start coding! Create a new file in your text editor of choice — Imma call mine index.js because I’m an free-spirited, fun-loving individual, but you can call yours whatever.

In your new .js file, write the following:

const { csvFormat } = require('d3-dsv');
const Nightmare = require('nightmare');
const { readFileSync, writeFileSync } = require('fs');
const numbers = readFileSync('./tesco-title-numbers.csv',
{encoding: 'utf8'}).trim().split('\n');
console.dir(numbers);

This imports a few things, including csvFormat from d3-dsv (which we’ll use later to output our results as CSV), Nightmare itself, and a few functions from the NodeJS filesystem (fs) package. We then use readFileSync() to load in our CSV file and then convert it to an array by splitting the resultant string by the newline (\n) character. Note if that was a real CSV file (i.e., with a header row and whathaveyou) we could have parsed it into an array using csvParse() from the d3-dsv package, but given how simple our input is, that would be overkill in this instance. Lastly, we output the resulting array using console.dir().

In your project directory, run:

$ node index.js

…And you’ll see an array of property title numbers. Fantastic, we’re well on our way.

The Nightmare Cometh!

Next thing to do is start scraping using Nightmare. We’re going to run a loop to plug each of the title numbers into the form located at:

https://eservices.landregistry.gov.uk/wps/portal/Property_Search

This site is hilariously coded and has totally unparseable URLs. The second we navigate anywhere, that already-long URL turns into a mess that looks like:

https://eservices.landregistry.gov.uk/www/wps/portal/!ut/p/b1/04_Sj9CPykssy0xPLMnMz0vMAfGjzOKNjSxMDA1NjDwsjM3MDTxN3dyNDUNMjQ1MjPWDU_P0C7IdFQG9k5Tz/

That’s not even a serious query, that’s just the search form. Good luck figuring out how to automate a query using just the URL! Why’s there an exclamation mark in the middle of that? What is the significance of that really long indecipherable string? We’re using Nightmare.js, though, so who cares? It doesn’t matter.

Let’s write a function that gets the address from the land title record and returns it as an object. This is going to get a bit weird if you’ve ever written a bit of JavaScript before and not done a lot with promises, but please bear with me.

We’re going to make an async function, which is why I made you jump through the hoops of installing Node 7. We can use these by supplying this flag when invoking Node with the following syntax:

$ node --harmony <file>

Marking a function as async lets the Node interpreter know the function will always return a promise — that is, a value that may eventually be resolved. The cool bit about this is that we can await other promises, and write code that acts normally even while a bunch of asynchronous processes are firing. I’ll write another post shortly going into more detail about this all and why it’s awesome, but suffice to say for the moment that you need to pay attention to this because Nightmare returns promises for everything.

Remove the “console.dir” line and replace it with the following:

const START = 'https://eservices.landregistry.gov.uk/wps/portal/Property_Search';const getAddress = async id => {};

Here we assign our starting point and our getAddress() asynchronous function. If new to ES6, the => bit means that it’s a function that is lexically scoped to the current block (but that doesn’t matter because we won’t be using the this keyword anywhere; I mainly use “fat arrow” here for the sake of brevity). We’re going to fill out that function by writing code between the curly braces:

const getAddress = async id => {
console.log(`Now checking ${id}`);
const nightmare = new Nightmare({ show: true });
// Go to initial start page, navigate to Detail search
try {
await nightmare
.goto(START)
.wait('.bodylinkcopy:first-child')
.click('.bodylinkcopy:first-child');
} catch(e) {
console.error(e);
}
}

We’ve added a line to log which identifier we’re on and initialised Nightmare using the new keyword, to which we pass an options object. In this instance, we’ve set the show property to true so we can monitor what the browser is doing while it works.

We then do our first action, which is wrapped in a try/catch block. One cool thing about async/await is that you handle errors using try/catch blocks, which is a pleasant and straightforward way for doing things when your code fails for whatever reason (and, if you’ve never scraped before —trust me, it will, because web-servers are the worst, and especially if they’re managed by small government bodies). In this instance, we just log the exception because it’s clear enough from that what’s happened.

Inside of our try block, we have Nightmare finally do some actual work. We use the await keyword to ensure the promise returned by Nightmare is resolved before continuing, then tell it to go to our starting point, wait for the first element with the .bodylinkcopy class to appear, then click that element. That will take us to the “detailed search” page, which we need for entering the property title number.

Next we need to enter our property title number value into the correct search field. Add the following after the last try/catch block, inside our getAddress() function:

// Type the title number into the appropriate box; click submit
try {
await nightmare
.wait('input[name="titleNo"]')
.type('input[name="titleNo"]', id)
.click('input[value="Search »"]');
} catch(e) {
console.error(e);
}

Much like the last action, this awaits Nightmare to resolve the promise returned by the queue of actions we’ve give it. First we tell it to wait for the correct input button to render on the page. Then we tell it to type in our property title id number, which is our sole function argument for getAddress(). We finally click on the submit button to initiate our search.

Okay, here’s where things get challenging. We’re now at a results page, and need to extract the proper data from the page. Add this new try/catch block after the last one:

try {
const result = await nightmare
.wait('.w80p')
.evaluate(() => {
return [...document.querySelectorAll('.w80p')]
.map(el => el.innerText);
})
.end();
return { id, address: result[0], lease: result[1] };
} catch(e) {
console.error(e);
return undefined;
}

Here we wait until a particular element has rendered on the page, then run a function on that page. This is where is gets weird — the JavaScript code in the .evaluate() callback is run in the execution context of the browser. That means it’s run in the same environment that the page is running JavaScript in—the code running in the browser has no knowledge of your Nightmare code otherwise. You can even inject entire JS/CSS files into the page if you so desire. What we do is use document.querySelectorAll to retrieve all of the oddly-named .w80p elements, of which there are two: the address and the lease type, both of which are useful pieces of data. Because document.querySelectorAll returns a NodeList, which doesn’t have Array.prototype.map available to it, I use the ES6 spread operator inside of an array to convert it to one. I then use .map() to return an array containing the inner text of each element. This array is what is assigned to result when the sequence resolves. If you supply a value after the callback function in .evaluate(), you can pass in arbitrary variables like this:

.evaluate(outer => { 
console.log(`This is from the outer scope: ${outer}`);
}, variableFromOuterScope);

Again, the Electron instance Nightmare is running isn’t really aware of that fact, so any variables defined elsewhere in your scope won’t otherwise be available to .evaluate(), which is admittedly sort of antithetical to how scopes normally work in JavaScript—just realize the variable scoping provided by the .evaluate() callback is entirely separate to that from the rest of the script.

After our .evaluate() call, we call .end() to close the browser window and free up memory. Assuming everything worked well, the results from .evaluate() will be assigned to result and we can then return that as part of an object.

Let’s try it with just one land registry number. Put the following after our function:

getAddress(numbers[0])
.then(a => console.dir(a))
.catch(e => console.error(e));

Remember, getAddress() is an async function, which means it returns a promise. Promises all have a .then() method that is called once the promise has been resolved, and a .catch() method for handling errors. Because we can only use await inside of an asynchronous function (which the main process is not), we use normal promise handling here, just echoing it out into our console. Again, .then() is the non-async function version of await, and .catch() is what you use instead of try/catch outside of an async function to handle problems. They ultimately work a bit different but that explanation will suffice for now.

Time to try running our scraper:

$ node --harmony index.js

Cool! That seems like usable data!

Time to do things as a sequential loop. Here we encounter another slight difficulty with promises, which is the fact that they aren’t blocking. This is actually a really good thing, but in this particular instance we don’t want to fry the Land Registry servers by hitting it with all of our requests simultaneously. Luckily, we can also use promises to manage this for us:

const series = numbers.reduce(async (queue, number) => {
const dataArray = await queue;
dataArray.push(await getAddress(number));
return dataArray;
}, Promise.resolve([]));

This is actually really cool, let me explain it:

numbers is an array of values. Because of that, we can call Array.prototype.reduce() on it. What happens here is we create a promise chain that initially resolves to an empty array, that we gradually add to as each of our getAddress() operations completes. Effectively, what we’re doing is adding a new operation to our queue of promises for each element in the array, then resolving them.

If you try to log the resulting series variable to console, you’ll just get a boatload of pending promises. That’s because console.log() runs immediately following the .reduce() operation and none of the promises have resolved yet. Instead, what we need to do is call .then() on our series variable so we get our array of resolved promises once our queue finishes.

series.then(data => {
const csvData = csvFormat(data.filter(i => i));
writeFileSync('./output.csv', csvData, { encoding: 'utf8' });
})
.catch(e => console.error(e));

Here we first filter out any undefined array elements (caused by a failed iteration of our scraper) using an identity function, and then use csvFormat() to reformat our array as a CSV string so we can write it out, which we do with writeFileSync(). Run our scraper again with:

$ node --harmony index.js

…And you’ll see Nightmare slice through that list of values like it’s hot butter!

You’ll notice at least one of those values doesn’t actually work — yet, because we’ve gracefully handled exceptions, Nightmare waits for 30 seconds and then continues on to the next request without breaking a sweat. We could, however, get it to do something else like try the request at a different URL if we needed to improve the accuracy of our scraper.

After a few minutes of working through the list, the script will end and you’ll have a shiny new .csv file containing the addresses corresponding to each property title reference number! Tada! 🎉

What’s the next step? That depends entirely on what sort of investigation you’re doing. One option that might be fun given our new dataset is to geolocate it using Google Fusion Tables. Here’s what that looks like:

Here’s the locations mapped in Google Fusion Tables. All I had to do was set the address column as a location, which I then geocoded using File » “Geocode…”. You can view the finished Fusion Table here.

Excepting the one point that somehow geolocated in India, seems like most of the locations we mapped are near Southampton.

That’s all there is to it. Nightmare can do a fair bit more—for instance, it can listen to events and react accordingly, making it possible to even scrape things like websocket-based chat systems, and can even take screenshots while it works (Or possibly as part of exception handling, allowing you to see exactly where a scraper task failed). Read the documentation for full details.

Other notes:

  • Nightmare’s .wait() method is really key to doing things intelligently, particularly after clicking a link or navigating in some way. It can take a CSS selector, a number in milliseconds or a function as an argument (that last option working like .evaluate() as per the explanation earlier).
  • In this example, we wait until everything is done and then pass that to a single function to do all the writing. In reality, this is incredibly prone to error as it means the entire queue must complete before anything is saved. In reality, you probably want to add some logic that writes to disk on every completed operation, skipping the completed tasks if you need to restart the scraper.
  • Set show to false to hide the Electron window. It can be kind of annoying keeping it open as it steals cursor focus and prevents you from doing anything else while the scraper runs, but having it open can be helpful when first writing your scraper as it lets you monitor what Nightmare is doing.
  • You can use Bluebird or some other promise implementation if you prefer, but I’ve generally not found it to be more worthwhile than the native ES6 implementation.
  • You can also use ES7 generators with Nightmare, but that’s way less fun than using async/await as detailed above.
  • Always consider the potential damage a scraper might do and balance that against the public interest. If you hit a website with 5000 simultaneous requests, it might appear to be a Denial-Of-Service (DoS) attack and you may have some ‘splainin’ to do.

Ændrew Rininsland is the author of Data Visualization with D3.js, 2nd edition from Packt Books and a newsroom developer at the Financial Times.
He tweets as @aendrew.

Many thanks to Leila Haddou for suggesting this piece and reading an early draft, as well as providing the land registry title number data set. Thanks also to Matt Brennan for suggesting an improvement to the promise queue code. This tutorial was originally written for the Journocoders December meetup.

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMI family. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!

--

--

Ændra Rininsland
Journocoders

Newsroom developer with Interactive Graphics at @ft. she/her or ze/hir. Rather queer. Opinions are mine, not employers. I'm hackers.town/@aendra on Mastodon.