How to crawl a website

What is WEB crawling?

WEB crawling is to acquire WEB site information from the API, copy it, and save it. There are money forward, Indeed, iQON, etc as well-known web services using WEB crawling.

Money forward gets customer’s bank course information, visualizes all deposits / receipts, indeed obtains recruitment information from various corporate homepages and publishes it to those who are looking for a job change, iQON sends products from more than 100 EC sites We got information and made it public.

WEB crawling method of Node.js

Since I am using Node.js, I will write about how to crawl Node.js. Node.js is recommended because it can be written in JavaScript and can be easily started.

WEB crawling methods can be roughly divided into two methods. It is crawling using a browser without GUI (headless browser) and crawling directly using API. I would like to explain below about these two methods including merit and demerit.

Headless Browser — Spookyjs

Headless browser is a browser without GUI. It is Spookyjs that Node.js can handle browsers without this GUI. Spookyjs internally uses Casperjs and Phantomjs. The relationship between these three libraries is as follows.

Phantomjs: In the original technology of Spookyjs, Phantomjs realizes a headless browser without a GUI.
Casperjs: A library that makes Phantomjs easier to use.
Spookyjs: A library that makes Casperjs available from Node.js.

Advantages of Spookyjs

  1. Suitable for crawling sites that need login.
  2. Even sites that acquire data asynchronously can easily be crawled without being conscious of asynchrony.
  3. Easy to get screen shot of the whole screen

Disadvantages of Spookyjs

  1. Processing is slow because it is rendering.
  2. It is not suitable for crawling sites that have infinite scrolling because you can not acquire only additional loading parts.
  3. It depends on processing speed, but it is not suitable for comprehensive crawling within the site
    It takes time to get used to it because the scope when writing the code is unique.
    It’s difficult at the beginning if you do not know the tricks when writing code such as waiting for processing as there is time to render

Get directly from the API

It is meaningless as it is. It is to understand the structure of the API of the web site by himself and acquire HTML or json etc and analyze it. I often use superagent as a module to hit the API. Because superagent is simple and easy to use, it is recommended.

Advantages of API

  1. Lightweight, simple and easy to handle.
  2. It can handle processing faster than Spookyjs because HTML rendering does not get caught.
  3. Even on sites with infinite scroll, you can hit the additional read-only API to acquire data of only additional parts.

Disadvantages of API

  1. It can not be used on sites requiring login

Summary

It is easy to use Spookyjs on a web site that requires login.
In crawling a large number of pages, it is faster to hit the API directly to acquire site information.
When crawling is done, it is good to tackle it after thinking once which it is better to do by either method in view of the above merit disadvantage.

In the case of yourself, the site which does not need login is trying to hit the API. It seems like you are using Spookyjs unavoidably only on sites that need login.

Please try it as a reference.