How to scrape the web with PhantomJS

Web scraping is a valuable tool in any web developer’s toolbox. Not every website has an easily accessible API that puts its valuable data at your fingertips. In these cases, it is immensely empowering to be able to grab the data yourself and create your own custom API.

In a lot of cases, a website can be scraped with mikeal’s request module. But you may find more dynamic websites will not play as nicely in handing over the data you are after. Some websites may have scripts that wait for certain events to be emitted from the browser before the page fully loads its entire contents. For example, a website may use jQuery and have a $(document).on(‘ready’) listener. In these cases, you need a more powerful tool that can wait for events itself so that you are sure you are able to scrape everything that you need. For the purposes of this blog post, we will be using PhantomJS as that tool.

First off, you must install PhantomJS:

npm install -g phantomjs

PhantomJS is not just a simple npm package, it is an actual headless browser. So in some instances, a simple npm install won’t be enough. If you are having trouble with the code below, you might also want to try installing PhantomJS with homebrew.

brew update
brew install phantomjs

With PhantomJS installed, we are ready to begin talking about the code you will use to accomplish grabbing the full html contents of the website you’re seeking to scrape and we’ll also discuss why it works.

Show your support

Clapping shows how much you appreciated Cory’s story.