Increase your scraping speed with Go and Colly! — The Basics
Let’s scrape Amazon to see how fast this can be. But first, let’s learn about the basics.
In this article, we’ll explore the power of Go(lang). We’ll see how to create a scraper able to get basic data about products on Amazon.
The goal of this scraper will be to fetch an Amazon result page, loop through the different articles, parse the data we need, go to the next page, write the results in a CSV file and… repeat.
In order to do this, we’ll use a library called Colly. Colly is a scraping framework written in Go. It’s lightweight but offers a lot of functionalities out of the box such as parallel scraping, proxy switcher, etc.
This article will cover the basics of the Colly framework. You can find the follow-up article here:
Increase your scraping speed with Go and Colly! — Advanced Part
Let’s unleash the power of Go and Colly and see how fast we can scrape Amazon’s product list.
Let’s inspect Amazon to determine the CSS selectors
From this page, we would like to extract the name, the rating (stars) and the price for each product appearing in the result’s page.
We can notice that all the pieces of information we need for each product are in this area:
With the help of the Google Chrome Inspector, we can determine that the CSS selector for those elements is “div.a-section.a-spacing-medium”. Now, we just have to determine the selectors for the name, the stars, and the price. All of those can be found thanks to the inspector. Here are the results:
Price: span.a-price > span.a-offscreen
Those selectors are not perfect: we will see later that we’ll encounter some edge cases where we’ll need to format the values we extracted. But for now, we can work with that.
The selector of the results list itself is “div.s-result-list.s-search-results.sg-row”. So the logic for our scraper will be: “For each product in the results list, fetch its name, stars, and price”
We’ll also handle the pagination in another section. For now, we can just see that the URL of the results page looks like this
In our case:
It is now time to implement what we found out in Go with the help of Colly.
Go & Colly implementation
Let’s create our
In Colly, you need first to implement a
Collector will give you access to some methods allowing you to trigger callback functions when a certain event happens. In order to implement a
Collector, we just need the following code:
You can find the list of the methods which accept a callbacks function here.
To give it a try, let’s use the
OnRequest method. This method is called before every request. It takes a function as an argument. We can implement it, this way:
OnRequestmethod will be triggered before every request. In our case, it is expected to write the name of the URL we’re visiting in the console.
If you try to run our program right now, it will, unfortunately, start and stop instantly. The reason is simple, we need to provide it an URL to visit. For this, you just have to use the
Visitmethod of our
Now if you try to run this code with
go run main.go
You should get the following result in your console:
Time to parse that HTML!
Now that we know how to request the Amazon’s result page, let’s do something with the HTML we get.
If we look at the methods that our
Collector provides, the
OnHTML one is probably the one we need. It takes a selector as the first argument and a callback function as the second one. It is probably a good thing to assume we can use the result’s list selector we determined previously as the first parameter.
We observe that the callback function gives us access to an
HTMLElement . This element is the result of what we get thanks to the selector we provided in the first argument.
We will use the
ForEach method provided by the type
HTMLElement in order to loop through the products in the search result list.
The callback function passed to the
ForEach method gives us access to each product one by one. From there, we can simply access the value we want with the CSS selectors we discovered in the first part. For example, the product’s name would be accessed like this:
For every product’s name we get, we print it. If you run your code now, you’d have a result look like that:
Product Name: Super Smash Bros. Ultimate
Product Name: New Super Mario Bros. U Deluxe - Nintendo Switch
Product Name: Accessories kit for Nintendo Switch, VOKOO Steering
Product Name: AmazonBasics Car Charger for Nintendo Switch
We could use the same method for the stars and the prices. But as I mentioned in the first part of the article, you’ll probably encounter some formatting issues. For example, instead of having 299.00 for the price, you might have something like $299.00$480.00. This is because the CSS selector we provided is returning multiple prices for one article if this one is on sale for example. Like that product for instance:
About the stars, the selector we provided returns something like “4.5 out of 5 stars”. Out of this result, our goal is to extract the first three characters.
To fix our prices and stars problems I created two small helper functions that will allow us to format the results the way we want. I won’t go through them in details since it would be out of the topic of this article. But here is the code:
Here is how our
main.go looks like when we apply those two function:
If you run the program now, the results would look like this:
Product Name: Nintendo Switch - Gray Joy-Con
Product Name: Nintendo Switch Console w/ Mario Kart 8 Deluxe
Product Name: Lego Star Wars Skywalker Saga - PlayStation 4 Standard Edition
In this article, we saw how to use the basics of Go and Colly by fetching data from Amazon. You can clone the full project from here. There are still a lot of things that can be improved, such as handling pagination, using different User Agent, concurrent requests, and more. Those topics will be covered in the next article. I’ll post the link here once it will be released.
I hope you enjoyed this article even though I’m not using Python. I chose Go because I saw there is good potential with Web Scraping in this language, but there isn’t a lot of documentation about it yet.
One more thing, I’m using Go for about one year, therefore I’m not an expert yet. If you see things I could improve, don’t hesitate to let me know. Thank you for reading my article!