Increase your scraping speed with Go and Colly! — Advanced Part

Jérôme Mottet
Jun 16, 2019 · 7 min read

Let’s unleash the power of Go and Colly to see how we can scrape Amazon’s product list.

Image for post
Image for post

Introduction

In this writing, I’ll show you how to improve the project we started by adding functionalities such as random User-Agent, proxies switcher, pagination handling, random delays between requests, and parallel scraping.

The goal of those methods is first, to improve the harvesting’s speed of the information we need. Second, we also need to avoid getting blocked by the platform we’re extracting data from. Some websites will block you if they notice you’re sending too many requests to them. I want to specify that our goal here is not to flood them with requests, but just to avoid getting blocked while extracting the data we need at an appropriate speed.

The Implementation

Randomize the User-Agent

Why do we need to randomize it?
The User-Agent needs to be randomized to avoid your script getting detected by the source we’re getting data from (Amazon in our case). For instance, if the people working at Amazon notice that a lot of requests contain the same User-Agent string, they could block you thanks to this information.

The solution
Lucky for us, Colly provides a package called extensions. As you can see in the documentation, it contains a method called RandomUserAgent . It simply takes our Collector as a parameter.

And that’s it! With this line of code, Colly will now generate a new User-Agent string before every request.

You can also specify the following code in the OnRequest method:

That way, our program will print the User-Agent string it uses before sending the requests and we can make sure the method provided by the extensions package works

Pagination

If we take a look at the result’s page. We notice that the pagination can be changed via the URL:

https://www.amazon.com/s?k=nintendo+switch&page=1

We also observe that Amazon doesn’t allow us to go over page 20:

Image for post
Image for post

With that information, we can determine that all the result page can be accessed by modifying the c.Visit(url) in our current code.

Thanks to a for loop, we’re now sending requests to all the pages from 1 to 20. This allows us to get more products’ information.

Parallelism

This option basically says to Colly: “You don’t have to wait that a request ends to start the next ones”. The c.Wait() at the end is here to make the program wait until all the concurrent requests are done.

If you run this piece of code now, you’ll see that it is much faster than our first try. The console output will be a bit messy due to the fact that multiple requests print data at the same time, but the whole process should take approximately 1 second to be done. We see here that it is quite an improvement compared to the 30 seconds of our first try!

Random delays between every request

Here you can notice that I added a random delay of 2 seconds. It means that between the requests, there will be a random delay of a maximum of 2 seconds that will be added.

As you can see, I also added a Parallelism rule. This determined the maximum number of request that will be executed at the same time.

If we run our program now, we can see that it runs a bit slower than previously. This is due to the rules we just set. We need to define a balance between the scraping speed we need and the chances of getting blocked by the target website.

Proxy Switcher

The solution
We will use proxies in order to do this. We will send our requests to the proxy instead of Amazon directly and the proxy will take care of passing our requests to the target website. That way, in the Amazon logs, it will appear that the request was coming from the proxy’s IP address and not our.

Image for post
Image for post
Proxy server explanation from Wikipedia

Of course, the idea behind it is to use multiple proxies in order to dilute all of our requests between each proxy. You can easily find lists of free proxies searching through Google. The issue with those is that they can be extremely slow. If speed matters for you, having a list of private or semi-private could be a better choice. Here is the implementation with Colly (and free proxies I found online):

At the time you’re reading this article, the two proxies I used might not work anymore. Feel free to change them.

It’s possible that the proxies I used are not working anymore at the time you’re reading this article. Feel free to modify them.

We’re making usage of the proxy package from Colly. This package contains the method RoundRobinProxySwitcher . This method takes strings containing the protocol, the address and the port of the proxies as an argument. We then pass the proxySwitcher to the Collector with the help of the SetProxyFunc method. After this is done, Colly will send the requests through the proxies and select another proxy before each new request.

Write the result in a CSV file

First, we create a file called amazon_products.csv . We then create a writer that will be used to save the data we fetch from Amazon in our file. On line 12, we write the first entry of the CSV file, defining the title of the column.

Then, in the callback function that we pass to the ForEach method, instead of writing the results we get, we’ll write them in the CSV file. Like this:

Here are the results we get once we run the software now. You should have a new file in the working folder. If you open it with Excel (or a similar program), here is how it looks like:

Image for post
Image for post

Conclusion

Disclaimer: Use the knowledge you’ve gained with this article wisely. Don’t send a huge number of requests to a website in a short amount of time. In the best case, they could just block you. In the worst, you could have problems with the law.

Thank you for reading my article. I hope it was useful for you. If you couldn’t follow along with the code, you can find the full project in this Github repository.

Happy scraping!

The Startup

Medium's largest active publication, followed by +756K people. Follow to join our community.

Jérôme Mottet

Written by

Software Engineer - React and React-Native Developer - Web Scraping Enthusiast

The Startup

Medium's largest active publication, followed by +756K people. Follow to join our community.

Jérôme Mottet

Written by

Software Engineer - React and React-Native Developer - Web Scraping Enthusiast

The Startup

Medium's largest active publication, followed by +756K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store