Develop a Large-scale Concurrent URLs Scraper in Go

Rain Wu
Random Life Journal
4 min readNov 20, 2020

This semester I participated in a course on network data mining, which includes a crawler-related project that had much room for technical play, and I think it is worth writing an article to share.

Image from Unsplash

The students were asked to collect unique URLs as much as possible with website-scraping skills but do not limit the tools used. For these kinds of IO-bound tasks, how to solve the performance bottleneck of waiting for network communication is a point worthy of attention (especially some poorly performing websites).

Architecture

This classic model first appeared in my mind, I do not want to make my project become too complicated at the beginning, the simple Broker-Worker model is enough.

Actually, there are also some sidecar entities like logger and config loader, but they are not the protagonist, so I am not planning to focus on them.

The technique of dependency injection was used in the implementation, because all the resource’s initialization cost, such as memory allocation and connecting authentication, we only need to do those things once.

Broker

If we take each new URL we scrape from a website as a job for sending requests later, the broker is acting as a job dispatcher which maintains a queue to accept the new job report and provide workers to fetch the pending jobs.

First of all, you must pay attention to the capacity of the control queue, because the workers will report plenty of new jobs according to the href attribute of <a> tag within the HTML of the current website, this makes your queue length grow very fast and run out of memory space soon if lost control.

Image from Unsplash

Another topic is screening the right and suitable jobs, basically I followed the principles below:

  • Unique
    Duplicated URL should neither be adapted into records nor assign as a pending job. To achieve this, in addition to searching the database for confirmation every time a new URL was reported, I also maintain an LRU cache to improve efficiency.
  • Discard heavy resources
    The full name of the URL is Uniform Resource Locator, as the name suggests, it helps the users access the resources on the internet. Some heavy files like images, sounds, videos, and special-form documents will spend a lot of time on package transmission, so I filter out the URL jobs with specific postfix.
  • Discard low-value URLs
    The query parameter is often used for selecting a particular group of a result set, but the more parameters in the URL means the less result we obtain. And it is very possible to make different queries to the same original URL at high frequency, so I filter them out.

Through these three principles, as much as possible to make every URL job that will be used to make a network request to have a certain harvest, and also not take too much time.

Workers

Workers are the main performers of crawler tasks, they will ask for pending jobs from the Broker, and report when a new URL is found. Although the native concurrency and GPM scheduler in Go already convert the IO-bound task into CPU-bound tasks, still need to pay attention to the details of the network request.

Use multiple workers at the same time could be a great strategy to boost the efficiency of firing requests, but had better to reuse them because of all initialization process slow performance.

The GPM scheduler also contains an OS thread (the “M” part) to deal with the system call, file access, and network request, this involves the issue of the right to use related hardware resources.

Image from Unsplash

Limiting the number of OS threads to be the same as the CPU core can prevent some context switches, spending extra time on transferring usage rights between workers is meaningless.

Many websites will take a long time to wait for a response. It may be because its performance is not good, or my request was judged to be a malicious attack and was slowed down. This problem was improved after I specified the timeout limit.

Database connection

As the specification described in the PostgreSQL documentation, it can handle at most 115 connections at the same time. But 15 of them are left to the system administrator to operate, our application can only use 100 connections or a too many connections error will be raised.

Hope the tips and tricks above can help you on the URL scraping tasks in the future :)

--

--

Rain Wu
Random Life Journal

A software engineer specializing in distributed systems and cloud services, desire to realize various imaginations of future life through technology.