Web Scraping with Concurrency in Golang
Hi everyone, my name is Bismo or sometimes they called me “Momo.” I am working as a Backend Engineer in Zeals. At Zeals, I am mostly taking care of RPA project microservices. In this article, I want to share something related to our project. It’s about how to do Web Scraping with Concurrency using Golang!
Maybe you’re wondering, “why do we need concurrency?” Sometimes, in a single request, we need to populate various data from multiple pages. In general, it would become a queue that is needed to wait for every page done before scraping to the other page. We have to face this condition and should deal with this performance issue. But, with Concurrency, it becomes possible to scrape multiple pages at the same time.
To give some context to what we should do, here are several points that might be helpful to understand:
Data Flow Design
As mentioned previously in the introduction part, the common scraping method used for multiple pages requires you to wait for every single page to be done before going on to the other. Here is the Data Flow to describe how common scraping works.
However, with a concurrency, in a single request, Web Scraper can scrape multiple pages at the same time. The Data Flow below shows how it’s different than the common scraping.
There might be a lot of use cases to use concurrency when doing scraping. Because I can’t share the real use case for our project due to a confidential issue, instead, in this case, let’s use get historical exchange rates for a specific currency.
The website https://www.x-rates.com/ is ideal for this example because the website doesn’t have an API. Also, the request for getting the history is limited by a specific date.
That means getting historical information for a month will send the 30 requests and the loading time will increase based on the number of requests as well. In this example, we will try to get a historical currency between two currencies within the date range.
Then in the next section, we will try to compare the speed results between non-concurrency and with concurrency.
The completed code will be huge. Please check this repository for the detail because in this article we will pick the important part only.
GitHub - moemoe89/go-currency-history: Repository for my Medium article.
Repository for my Medium article. Contribute to moemoe89/go-currency-history development by creating an account on…
Basically, to get the content of the page, with Golang we can simply use
net/http package and send the request with
Package http provides HTTP client and server implementations. Get, Head, Post, and PostForm make HTTP (or HTTPS)…
To easily read HTML tags by selectors, we can use this Go lib called
goquery . The details of the installation or how to use it can be found directly in this repository.
GitHub - PuerkitoBio/goquery: A little like that j-thing, only in Go.
goquery brings a syntax and a set of features similar to jQuery to the Go language. It is based on Go's net/html…
First, create a reusable function to get the page content and parse it to HTML document ready using
goquery package. To use this function just need to pass the parameters such as target URL, method type, and the others if needed (header, form-data, cookies, etc).
Next, we will try to parse the currency value. The original target URL will be
https://www.x-rates.com/historical/?from=IDR&amount=1&date=2022-03-19 . Our service will have parameters like
from, to, date
This function below is the final scraping code to get the currency based on parameters. Because this function only scrapes on a specific date, we need another function to do an iteration by the time range parameters.
After that, we need to implement the concurrency part. We will iterate the time range parameter and use error group because we need to handle the error. The final code will look like this:
In the benchmark section, we will compare the performance between using concurrency and without concurrency. The code without concurrency will look like this:
That’s all the important code needed to explain the concept of Web Scraping with concurrency in Golang. The details can be found in the repository mentioned in the previous section.
Finally, we are on to the interesting part which is benchmarking!
The scenario will be testing multiple requests with different numbers of date queries (1, 2, 5, 10, 20, and 30) for non and with concurrency. How to simulate the benchmark basically just running the service and calling an endpoint about currency history. The number of queries is based on a different number of days between the start and end date.
For the example having 10 queries:
FYI, I'm using this PC specification when doing the test.
- Mac mini (M1, 2020)
- Chip Apple M1
- Memory 16 GB
- macOS Monterey Version 12.3
- Internet speed 42.53 (Download) 15.34 (Upload) 29ms (Ping)
- Internet region: Indonesia
And here is the benchmark result with a line chart:
As the prediction, in the beginning, the response time for non-concurrency will consistently increase depending on the number of requests. But with concurrency, the response time also increased, but was not significant.
Note: The response time might also be depend on the website condition, traffic, internet speed, region, etc.
When working on Web Scraping in my project at work, I can say the implementation in Golang is very simple and powerful. If we don’t need to manipulate the DOM and purely get the data alone, with some pages needing to be scrapped at the same time, I can give the recommendation to try this Data Flow design!
Note: Some websites may implement rate-limiting, and in this case, then the number of concurrencies should be considered under the rate-limiting value.
Thank you! I hope this article is useful for you!