Data Scraping AJAXed Pages with Phantom and Node
Not every web page comes loaded with its full content, in fact it’s becoming more and more common to send just the framework of a page to display a barebones version of the site on first load, then send another request to the server for the actual data to go on the page. That’s great for perceived performance, but if you need to load the page programmatically, you tend to end up with only the placeholder data.
I ran into this problem when I was trying to find the tournament host name for a challonge.com tournament. The API somehow didn’t have that field, so I needed to parse it from the tournament page. “Easy enough,” I thought, “I’ll just install node-fetch and grab the HTML that way!” Little did I know that the host name was AJAXed in after page load! It took a few twists and turns to figure out how to grab the site data only AFTER the full load had completed, and I’m here today to share it in case anyone else needs to get some data from a page that relies on post-initial-load server calls. (My post about that project is here, if you’re interested.)
Let’s get started!
Phantom of the Server
PhantomJS is a fantastic command line webkit renderer. The only catch is that if you get it from their website, they want you to install it, and make sure it’s on your PATH, and run it using their specific CLI. Things can get hairy there, and it’s not particularly easy to integrate with your server, so let’s skip all that mess with a handy wrapper library that will take care of installation for us. Specifically, the Phantom wrapper.
Let’s start a new project in an empty folder. From that folder, run:
npm install --save phantom
Now we have access to PhantomJS’s full suite of tools from within our project. Easy! For comparison’s sake, let’s also install node-fetch to run both side-by-side.
Let’s hop into a simple index.js file and test it out.
What we’re doing here is a quick comparison (string length) of the full page html returned from node-fetch vs phantom. I’m using the URL of a Super Smash Brothers Melee tournament that I placed terribly in. Feel free to swap out your own url; I noticed in testing that Google’s homepage AJAXes in a lot of content post-load.
If we’ve succeeded in waiting for the full content to load with phantom, the returned string should be significantly longer. Let’s test it! Run:
This should take a second, and then spit out something like:
There you have it! We’ve managed to get the data from the full page, and thanks to the
--load-images=no flag we included, we managed to skip a lot of the overhead that can come with server-side rendering. From here, it’s easy to parse out the data that you need from the page.
You could also use Phantom to grab screenshots of web pages, measure load times, and a lot of other fun metrics. You can see how simple it would be to make your own automatic site-tracking like tlapse if you wanted to. Now that you have full server-side programmatic freedom, you can do anything!
Have fun :)