All I wanted to do is scrape in JavaScript…

Edit: This isn’t a tutorial, just a tale of JavaScript tribulations. If you’re trying to scrape in JavaScript I’d recommend puppeteer.

All I wanted to do is try to learn some web scraping in JavaScript. A month or so later, and several iterations of my scrape script, I’ve learned a few things. Recording them here, for myself and for anyone else who may travel down this path.

So what kind of data should I scrape? Fish data! I’m interested in fish, I work part time at an aquarium, and when I was a kid I used to breed platies, a common kind of live bearing beginner fish.

A female blue mickey mouse platy

So where to get data? Well in my platy breeding days, the place to get high quality, unique breeds was aquabid.com. Cool thing, I just checked, it still is. Aquabid is basically ebay for fish. So I can scrape the tables there to get info on how much different kinds of fish cost. And then hopefully visualize that data across time.

First I tried Cheerio (jQuery on the backend) and request (a native module) in Node. That was pretty easy to set up. So I scraped the aquabid pages for the current auctions. But that really wasn’t the data I wanted. Anyone can put any fish at any price. What I really wanted was prices of closed auctions. Especially ones that had been sold.

Unfortunately, there isn’t a specific url for closed auctions per fish. To get info on actual sales, you must submit a form. This can’t be accomplished through node/cheerio/request, since they can only make a request to given url, and then parse that.

In order to interact and click and submit on a webpage, we need a scriptable (preferably headless, (basically a browser without a window)) browser. This way we can program the browser, grab the info we need, and eventually save it to a database.

I originally chose Phantomjs, a headless webkit that can be programmed in JavaScript. And importantly, Casperjs, an abstraction layer on top of Phantom that helps write shorter cleaner code for the kinds of clicking and navigation I’ll do. Also note the ghostly naming convention for these tools! Cute!

Given that I wanted to iterate through many kinds of fish, I made an array and used the casper.each method to move through each fish, setting the fish type and number of days in the past to view. Luckily, that submit form is present on each closed auction page, so I don’t have to navigate back to the base “closed auction” page to start on the next fish.

casper.each(fishArray, function(self, currentFish){
/*
* .evaluate is not async, so it must be wrapped in a .then
* .then is STEP Async, they're executed one after the other
*/
casper.then(function(){
console.log("Getting: " + currentFish);
// Change the drop down selections
casper.evaluate(function(currentFish) {
$('select[name="category"]').val(currentFish).change();
$('select[name="DAYS"]').val('1').change();
},currentFish);
});
...

I had been initially writing my results to a new file on my hard drive.

fs.write(pathToFolder+soldPath+fish+outputFormat, JSON.stringify(soldJSON, null, 4), 'w')

And everything was hunky dory. I even spun up a little node server to then display the sales data in a d3 chart.

But here’s the thing, physical drives aren’t cool anymore man. You gotta have a database in the Cloud mannnnnn. Plus also, I didn’t want to manage my own database, because I don’t know how. So I tried this thing called Firebase.

Firebase does a bunch of stuff, but for me, it handles storage and retrieval of my data. So at the end of my scrape, within Phantom, I send off a POST request to my Firebase url:

casper.then(function(){
console.log("Sending to Firebase...");
// Open the url for the database
casper.thenOpen("https://aquascraper-data.firebaseio.com/test.json?auth="+deets.deets+"&debug=true",{
method: "post",
data: JSON.stringify(allFish),
headers: {
auth : "xxxxxx",
},
contentType : 'application/json',
dataType: 'json',
},function(response){
casper.echo("POSTED TO Firebase: "+JSON.stringify(response));
});
});

And then to retrieve my data I’ll grab a reference to my public url and can interact with Firebase with the provided library. To manage private databases you’ll need authentication and that will require running a server. For now, I can just grab my data from my database.

Storage view for Firebase

Data is structured as a big JSON object. Each nonsense text block is a is a key with a value of my submitted data for that push to Firebase. I will restructure this later, so that these each represent an individual day.

Great! Wasn’t the easiest trip to datatown, I had to ask some of the Firebase people on slack how to get it to play nice with Phantom, but hey I got there!

So now, to make this a regular scrape, without me having to lift a finger, I wanted to put my script on Heroku. Cool cool, I’ve used Heroku before I can just pop it in there… oh wait… This isn’t a Node.js app, it’s a Phantom one.

Hmm ok so there’s this thing called buildpacks, that define what kind of app Heroku needs to run. There are official ones for Node, Ruby, Scala etc… and then you can also make your own with and add on to them. Luckily, there’s already one for Phantom and Casper. Great I can just pop it in there!

Hmm, for some reason my app keeps crashing… I didn’t have this issue on my local version. Let me check the logs…Memory exhaustion???? What? Why?

Turns out, Phantomjs has this classsssssssssic memory leak issue, where it does not free memory when closing a page. This is actually a QtWebKit issue. But check this legendary Github thread:

And this one where the guy just screams at the Phantom maintainer.

https://github.com/ariya/phantomjs/issues/14143

So great. Phantomjs leaks a bunch of memory on every page open. I do this about 40 times eventually turning RAM usage way past Heroku’s limit of 512mb. I just hadn’t noticed when running my script locally. Windows isn’t the best way to profile memory usage, but there’s over 1Gb being used here. Just to load some pages, scrape them and store a few kbs of data! That’s crazy.

Phantom being a fat leaker

The eloquent Suhar777777 mentioned SlimerJS (Slimer, being one of the ghosts from Ghostbusters) though. So that’s what I investigated next.

SlimerJS is another scriptable web browser, this one running on Gecko (the engine that powers Firefox) and unlike Phantomjs, is not headless. This means a little browser window pops up when you run your script. Kinda cool to see actually! There are a few more differences between the two, but luckily, Casperjs is compatible with both. A few small tweeks had to be made to ensure script would run but after that:

casperjs scrape.js --engine=slimerjs

And we’re back on our feet again! This time with a much more reasonable memory consumption, hovering around 300mb. Well under the 512mb max for Heroku.

Ok got to put it on Herkou. Now for that buildpack issue again… There are a few SlimerJS buildpacks, but they all use the deprecated buildpack-multi builbpack. Heroku now natively supports multiple buildpacks, but here’s the thing… I don’t know how to make a buildpack. After trying three different buildpacks, and never getting one to successfully compile on Heroku, I gave up and went home for the day.

Having noted that Node can control Phantom, I set off down another rabbit hole. Let’s see if I can get Node to run Phantom and then place both of those scripts in Heroku. Node can pass in the current fish I’m getting the data for, start and scrape with Phantom and then store the stdout as the data. Something like:

var promises = fishArray.map(function(currentFish) {
return new Promise(function(resolve, reject) {
exec('casperjs turnitoffandon.js',
{
env: {
'currentFish': currentFish
},
},
function(err, stdout, stderr) {
if (err) {
console.log('ERROR. err was:' + err);
console.log('ERROR. stderr was:' + stderr);
return reject(err);
} else {
console.log('success stdout was:' + stdout);
allAuctions[currentFish] = JSON.parse(stdout);
//console.log(allAuctions);
resolve();
}
}
);
});
});

Here we’re using child_process’s exec function, which lets you run any other command in any language, or runtime. Like Phantom! We pass in the fish as an environmental variable, and then handle the return with a callback. Exec returns a promise, and I wait until all of the Phantom/Casper scripts return, to send off the data to Firebase. Like so:

Promise.all(promises)
.then(function() {
console.log("allAuctions are done."+allAuctions);
console.log("Sending to Firebase...");
var options = {
method: 'post',
body: allAuctions,
json: true,
url: "https://aquascraper-data.firebaseio.com/test.json?auth="+deets+"&debug=true"
};
request(options, function (err, res, body) {
if (err) {
console.error('error posting json: ', err);
throw err;
}
var headers = res.headers;
var statusCode = res.statusCode;
console.log('headers: ', headers);
console.log('statusCode: ', statusCode);
console.log('body: ', body);
});
})

Unfortunately, as is the nature of Node, this is all asynchronous, which means this spawns all 40ish Casper scripts at once. This actually slowed my computer down, and was definitely not going to fly on Heroku either.

So whats the solution? Go against the Bible of Node, and make these calls synchronous! Using execSync within a forEach loop, slowly turning Phantom off and on, and storing each execSync’s return buffer in a variable, then finally sending a post request from Node. This made for some pretty sloppy code, fighting against the asynchronous nature of Node. But I did finally get it working.

Now it’s time to put it up on Heroku. Unfortunately, I ran into the buildpack problem again. This time, I’m trying to have 2 runtimes, plus the Casper library on top of Phantom. Here’s what that looked like:

The set of buildpacks I used to (almost) run a Node controlled casper script

Again this seems sloppy and has too many moving parts. But I could run each individually. Casperjs — version, phantomjs — version, and node -v all returned their version. Unfortunately, running the node script would throw this error for not being able to find casperjs.

Error: Command failed: casperjs turnitoffandon.js
ERROR: stderr was:/bin/sh 1: casperjs: not found

I actually never solved that issue, and there’s an outstanding Stack Overflow question regarding it. I’m sure its just a matter of me not understanding the Heroku file system.

Just by happenchance, I stumbled into this StackOverflow thread. It mentions that ensuring loading images is turned on, prevents some of the leak. Previously I had been ensuring that loading images was disabled. You know, to save memory. Apparently I was wrong. Suddenly, by turning on the loading of images, I was saving hundreds of megabytes of leakage.

pageSettings: {
// NO WAIT, DO LOAD IMAGES, for some reason, this prevents a worse memory leak
loadImages: true,
loadPlugins: false,
},

Now my original phantom script hovers around 300mb and runs faster than any other iteration. Hooray! Now I can pop this back onto Heroku without memory issues.

Now to schedule this puppy! Heroku has a nice scheduler add on that will run your script regularly. This will run at midnight PST everyday (7:00 UTC = 00:00 PST), sending my data to Firebase.

Now we’re on the Scheduler!

I’ve still got to define exactly the structure of the data on Firebase. For now I’ve just been sending each day, but I can probably structure it so that each month is its own path. That would be slightly easier to look at. And then eventually I’ll make a d3 powered site, displaying the fish sales data in a fun data visualization project!

The aquacraper can be viewed/downloaded/forked at https://github.com/nodes777/aquascraper

If you have any hints about getting Slimerjs running on Heroku or about the Node controlled, missing Casperjs path on Heroku. I’m all ears!

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

EDIT: I had one final issue with this project. When running the script on Heroku Scheduler, I would wake up in the morning to dozens of scrape results being posted on my Firebase. Looking at the Heroku logs:

Script runs fine, exits and then…. crashes??

A quick Google boy will tell you ON WINDOWS SOMETIMES PHANTOMJS JUST DOESN’T EXIT. What??? Why?? How?? This is the most baffling issue I’ve run into yet. How do you put something like this out, it literally never ends, just eating away memory.

There were a bunch of weird workarounds people have suggested for solving this. For example using casper.bypass(999) to bypass all the steps in the script, forcing an exit. But the golden goose award goes to maciejjankowski for cracking this one with the change your graphics card solution.

I don’t know how or why this works. Don’t ask me.

Luckily I’m a quick thinking jack rabbit, and before I snapped my laptop in half to swap out the graphics card, I looked up what kind of servers Heroku uses. Linux! Of course! Saved!

But not. I still had the issue of my script exiting with a successful 0 code, and then crashing for some unexplained reason. And as per the Heroku docs, anytime a dyno crashes, Heroku will try to restart it… again and again. Since my script was not actually crashing, the scheduler was making posts until it got too tired. So why was I getting a crash even thought I was reporting a successful 0 exit code?

The answer lay with another floating Stack Overflow question, drifting by in another language I didn’t understand. So the scheduler started a worker I had defined in my Heroku Procfile, and the Procfile worker needs a dyno to stay alive, and since it must stay alive, it cannot die. Therefore, Phantom would just hangout long after my script had finished, until eventually Heroku told it to go home, and then when Phantom did, Heroku crashed.

Look I don’t understand it. All I know is I removed my Procfile ( The only reason I made one was cuz I thought you had to, I thought those were the Heroku rules), and now the scheduler is running the script and putting Heroku to bed properly.

State changed from up to COMPLETE

Seems like you’re still just kind of screwed if you’re trying to do this on Windows though. Change your graphics card kids!