Boost your scraping with asynchronous programming

Raffaello Ippolito
4 min readJul 27, 2023

--

Web scraping is a very useful practice indeed, as discussed in my previous article, it can be useful for obtaining entire datasets on which to do visualization analysis or whatever. So in most cases we will want to collect a large amount of data, and to do this we will need to visit several web pages.
Loading a web page is not a particularly complex operation, especially if you use appropriate frameworks that are certainly lighter than a classic browser, nevertheless it is not an immediate operation and visiting thousands of web pages can be a really long operation, asynchronous programming can be the solution to our problems.

For the purpose of this video I will use as an example a project that I have recently done as an exercise.
The goal of the project is to build a database containing all the attributes of all the players on the football video game FIFA 23, to do this we will collect information on the website fifaratings.com. Taking a look at the site one can easily see how each player has a dedicated page with all the information about him. Those pages are our targets.

Let’s start building the application following the traditional approach and try to process players from only two teams and take a look at the performance:

Time taken for scraping:       194.7440423965454
Time taken for CSV writing: 0.0010013580322265625
Total execution time: 194.74504375457764

Okay, the script took 194 seconds to retrieve information on 48 football players; that’s just over 3 minutes, not bad right? Well not really. As long as there are only 48 players, we can wait 3 minutes, but we want them all! There are 18,657 football players on this site, which means that to reach our goal we will have to visit an equal number of web pages, quite a job. Suppose it takes the same amount of time for each footballer and try to scale the performance, then for 18.657 players the estimated time is 75.694,5585 seconds! That’s about 21 hours! Do I really have to wait that long?

The process is quite simple, retrieve the list of all the urls of individual players’ pages, and start visiting them one by one. At this point a question arises, “Why do I have to wait until I have retrieved all the information about Kylian Mbappé to start looking for the information about Erling Haaland?” You don’t have to! With an asynchronous approach, we can request that multiple operations run in parallel instead of sequentially, putting all the cores of our computer to work.

Let’s rebuild our application with asynchronous programming and take a look at performance by retrieving player info from the usual two teams:

Time taken for scraping:       16.944170713424683
Time taken for CSV writing: 0.0010013580322265625
Total execution time: 16.94517207145691

Well I would say the situation has definitely improved, we have reduced the execution time from 194 seconds to 16 seconds, more than 10 times faster!

Before we run the program and build our database there is one more thing to pay attention to. Many sites have systems in place that prevent users from running too many requests at once to avoid overloading their servers. In this project I wanted to get comfortable and so I decided to run only 50 requests at a time and wait a second between the completion of one set before running the next, if you want you can push harder by running more requests at a time or even by running the next set without waiting for the completion of the previous one but just waiting as long as necessary so as not to clog up the site, let’s try running our scraper and take a look at the performance.

Time taken for scraping:       3507.0338554
Time taken for CSV writing: 0.15390189999925497
Total execution time: 3507.1877572999992

At the end of the day, the script was able to retrieve information on all 18,657 players in just over 58 minutes; there is still room for improvement, such as those mentioned just above, but as far as I am concerned I am very happy with this result.

I hope this reading has stimulated you to think outside the box and made you appreciate the power of a different approach than the traditional one. The concepts seen in this article can be applied to any process in which there are operations le that are being performed that are not dependent on each other. We will see this concept again in future articles.

--

--

Raffaello Ippolito

Italian software developer and data analytics student. Graduated in Mathematics for Engineering talking about Big Data and Image Processing