How to speed up your python web scraper by using multiprocessing

Adnan Siddiqi
Python Pandemonium
Published in
4 min readDec 14, 2016
https://upload.wikimedia.org/wikipedia/commons/8/81/HTML5_de_Erick_Dimas.jpg

In earlier posts, here and here I discussed how to write a scraper and make it secure and foolproof. These things are good to implement but not good enough to make it fast and efficient.

In this post, I am going to show how a change of a few lines of code can speed up your web scraper by X times. Keep reading!

If you remember the post, I scraped the detail page of OLX. Now, usually, you end up to this page after going through the listing of such entries. First, I will make a script without multiprocessing, we will see why is it not good and then a scraper with multiprocessing.

OK, the goal is to access this page and fetch all URLs from the page. For the sake of simplicity, I am not covering the pagination part. So, let’s get into code.

Here’s the gist of accessing listing from the URL and then parse and fetch information of each entry.

I have divided scripts in two functions: get_listing() access the listing page, parses and saves the list in list and return it and parse(url) which takes an individual url parses the info and returns a comma delimited string.

I am then calling get_list() to get a list of links and then using mighty list comprehension to get a list of individual parsed entry and saving ALL info in a csv file.

Then I executed the script by using time command:

Adnans-MBP:~ AdnanAhmad$ time python listing_seq.py

which calculates the time a process takes. On my computer it returns:

real	5m49.168s
user 0m2.876s
sys 0m0.198s

hmm.. around 6minutes for 50 records. There’s a 2 seconds delay in each iteration so minus 1 and a half minute so makes it 4 and a half minute.

Now.. I am going to make a few lines of change and make it running into parallel.

Keep reading!

The first change is using a new Python module, Multiprocessing:

from multiprocessing import Pool

From the documentation:

multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows.

So unlike thread, it locks the thing w.r.t a process instead of a thread. It reminds me kind of MapReduce thing but obviously not the same. Now note the following lines:

p = Pool(10)  # Pool tells how many at a time
records = p.map(parse, cars_links)
p.terminate()
p.join()

The Pool here playing an important role, it tells how many subprocesses should be spawn at a time. Here I mentioned 10 which means 10 URLs will be processed at a single time.

In the second line, the first argument is the function that will be multi-processed and the second argument is the number of links in the list format. In our case there are 50 links so there will be 5 iterations, 10 URLs will be accessed and parsed in a go and will return data in the form of a list.

The third line is actually terminate a process, in *Nix it’s SIGTERM and In Windows, it uses TerminateProcess()

The last line, in simple words join() makes sure to avoid zombie processes and end all process gracefully.

If you don’t use terminate() and join(), the ONLY issue you’d have that you’d have many zombie or defunct process occupying your machine without any reason. I’m sure you definitely want that.

Or, you can use Context Manager to make it further simple.

with Pool(10) as p:
records = p.map(parse, cars_links)

Alright, I ran this script:

Adnans-MBP:~ AdnanAhmad$ time python list_parallel.py

And the time it took:

real 0m22.884s
user 0m2.748s
sys 0m0.363s

Here, the same 2 seconds delay but since all processed in parallel so it took around 22 seconds. I reduced the Pool Size to 5, still, quite good!

real 0m43.695s
user 0m2.829s
sys 0m0.336s

I hope you’d be implementing multiprocessing to speed up your next web scrapers. Give your feedback in comments and let everyone know how could it be made much better than this one. Thanks.

Writing scrapers is an interesting journey but you can hit the wall if the site blocks your IP. As an individual, you can’t afford expensive proxies either. Scraper API provides you an affordable and easy to use API that will let you scrape websites without any hassle. You do not need to worry about getting blocked because Scraper API by default uses proxies to access websites. On top of it, you do not need to worry about Selenium either since Scraper API provides the facility of a headless browser too. I also have written a post about how to use it.

Click here to signup with my referral link or enter promo code adnan10, you will get a 10% discount on it. In case you do not get the discount then just let me know via email on my site and I’d sure help you out.

As usual, the code is available on Github.

The original version of this post is available here.

--

--

Adnan Siddiqi
Python Pandemonium

Pakistani | Husband | Father | Software Consultant | Developer | blogger. I occasionally try to make stuff with code. https://adnansiddiqi.me