Scrape any Website/Service/API with a single SQL Select Statement

Mike Parsons
HackerNoon.com
4 min readJan 18, 2017

--

I love SQL and it never ceases to amaze me what can be accomplished via the power of SQL syntax. In this example, we are going to create a simple wrapper that treats any Web page or HTTP endpoint as a SQL table that can be queried via a single SQL select statement. Furthermore we will demonstrate how we can takes the results of our query and export it into a variety of useful formats : (JSON,TEXT, TAB/CSV Delimited, EXCEL, etc).

This example leverages one of my favourite applications, PhantomJS.

PhantomJS is a headless WebKit scriptable with a JavaScript API that enables such functionality as:

HEADLESS WEBSITE TESTING

Run functional tests with frameworks such as Jasmine, QUnit, Mocha, Capybara, WebDriver, and many others.Learn more

SCREEN CAPTURE

Programmatically capture web contents, including SVG and Canvas. Create web site screenshots with thumbnail preview. Learn more

PAGE AUTOMATION

Access and manipulate webpages with the standard DOM API, or with usual libraries like jQuery.
Learn more

NETWORK MONITORING

Monitor page loading and export as standard HAR files. Automate performance analysis using YSlow and Jenkins. Learn more

In conjunction with PhantomJS, we are also going to leverage the awesome JavaScript SQL library- ALASQL from:

AlaSQL — ‘à la SQL’ — is a lightweight client-side in-memory SQL database designed to work in browser and Node.js.

AlaSQL was written with pure JavaScript and does not use browser WebSQL database.

AlaSQL is fully functional compact sql server with JOINs, GROUPs, UNIONs, ANY, ALL, IN, subqueries and very limited transactions support.

AlaSQL supports ROLLUP(), CUBE() and GROUPING SETS() functions

AlaSQL works with all modern versions of browsers (Chrome, Firefox, IE, Safari), Node.js, and mobile iOS and Android.

AlaSQL is fast, because it uses some optimization methods.

The code to perform our task is as follows:

To make this work, you first need to install PhantomJS. The instructions are here:

Once you’ve successfully downloaded PhantomJS, create a new text file called scrapepage.js using the above code. From your command prompt, you can then start using the application by simply typing in a command like the following:

this should produce a list something like the following:

or you could try something like the following:

Notice that you can take advantage of most of the options of ALASQL, described here:

You can obviously pipe the results of your query into a file as follows:

For example, let’s create an Excel spreadsheet by typing:

As you can see, you can pretty much takes the contents of any webpage and convert it into a useful format by executing a single command!

Each row in the documents table represents a single node in the HTML Dom (Document Object Model) Tree of the web page. To see all the fields on a particular node, open up the development tools in the browser and take a peek at the nodes in the console :

To access child properties on the nodes, use the syntax described here:

For example, to access properties on the style object, you could specify:

To call methods,

This article just scratches the surface of what is possible using PhantomJS in conjunction with ALASQL. I highly recommend reading the documentation on both products to master the art of web page scrapping.

Please recommend this article if you liked it!

Hacker Noon is how hackers start their afternoons. We’re a part of the @AMI family. We are now accepting submissions and happy to discuss advertising & sponsorship opportunities.

If you enjoyed this story, we recommend reading our latest tech stories and trending tech stories. Until next time, don’t take the realities of the world for granted!

--

--