How to scrape the web with PhantomJS
Web scraping is a valuable tool in any web developer’s toolbox. Not every website has an easily accessible API that puts its valuable data at your fingertips. In these cases, it is immensely empowering to be able to grab the data yourself and create your own custom API.
In a lot of cases, a website can be scraped with mikeal’s request module. But you may find more dynamic websites will not play as nicely in handing over the data you are after. Some websites may have scripts that wait for certain events to be emitted from the browser before the page fully loads its entire contents. For example, a website may use jQuery and have a $(document).on(‘ready’) listener. In these cases, you need a more powerful tool that can wait for events itself so that you are sure you are able to scrape everything that you need. For the purposes of this blog post, we will be using PhantomJS as that tool.
First off, you must install PhantomJS:
npm install -g phantomjs
PhantomJS is not just a simple npm package, it is an actual headless browser. So in some instances, a simple npm install won’t be enough. If you are having trouble with the code below, you might also want to try installing PhantomJS with homebrew.
brew install phantomjs
With PhantomJS installed, we are ready to begin talking about the code you will use to accomplish grabbing the full html contents of the website you’re seeking to scrape and we’ll also discuss why it works.