Data scraping time comparison| Python multiprocessing vs multithreading.

4 min readMay 1, 2022

Scrapy is the most used Python library for data scraping. The main reason behind this is its speed. It is very well optimised and is specifically designed to handle multiple requests and parse them in the fastest and most efficient way possible.

But what if someone wants fast data scraping without using Scrapy or other Python libraries? Is it possible to manually optimise and configure a scraper to make it work like Scrapy?

The answer is yes if that user knows about multiprocessing and multithreading. Multiprocessing allows you to create programs that can run concurrently (bypassing the GIL) and use the entirety of your CPU core. On the other hand, Multithreading shares the same data space with the main thread and can therefore share information or communicate with each other more easily than if they were separate processes

To know more -https://towardsdatascience.com/multithreading-vs-multiprocessing-in-python-3afeb73e105f

Without using multiprocessing or multithreading :

import requests
from bs4 import BeautifulSoup
import timedef get_countries():   main_url =" https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"   all_countries = []
   response = requests.get(main_url)
   soup = BeautifulSoup(response.text, 'html.parser')
   all_cntry = soup.select(".flagicon+ a")
   for link in all_cntry:
        cntry_link = link.get("href")
        cntry_link = "https://en.wikipedia.org/" + cntry_link
        all_countries.append(cntry_link)
   return all_countriesdef growth_rate(url):
    response = requests.get(url)
    file_name = url.split("/")[-1]
    with open(file_name + ".html", 'wb') as f:
         f.write(response.content)
         print('.',end='',flush=True)if __name__ == '__main__':
     start_time = time.time()
     list = get_countries()
     for items in list:
         growth_rate(items)
     end_time = time.time() - start_time
     print(end_time)

Execution time= 105.06 sec

Code Guide: get_countries function returns a list of URLs of demographic of 241 countries. growth_rate function accepts a parameter “url”
and it dumps the entire HTML of that URL in an HTML file. In this code, we are simply using for loop to execute the web scraping task.

Multiprocessing :

import requests
from bs4 import BeautifulSoup
import time
from multiprocessing import Pool, cpu_countdef get_countries():main_url =" https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"all_countries = []
   response = requests.get(main_url)
   soup = BeautifulSoup(response.text, 'html.parser')
   all_cntry = soup.select(".flagicon+ a")
   for link in all_cntry:
        cntry_link = link.get("href")
        cntry_link = "https://en.wikipedia.org/" + cntry_link
        all_countries.append(cntry_link)
   return all_countriesdef growth_rate(url):
    response = requests.get(url)
    file_name = url.split("/")[-1]
    with open(file_name + ".html", 'wb') as f:
         f.write(response.content)
         print('.',end='',flush=True)def multiprocessing(growth_rate , list):
    with Pool(cpu_count()) as p:
    p.map(growth_rate, list)if __name__ == '__main__':
     start_time = time.time()
     list = get_countries()
     multiprocessing(growth_rate , list)     end_time = time.time() - start_time
     print(end_time)

Execution time: 29.82 sec

Code Guide: Line 4 imports the required module. multiprocessing function accepts URL list and growth_rate as parameters and then we use Pool function with cpu_count() as a parameter to execute a map method. Map method accepts a function and an iterable and executes the task.

Multithreading :

import requests
from bs4 import BeautifulSoup
import time
from concurrent.futures import ThreadPoolExecutordef get_countries():main_url =" https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"all_countries = []
   response = requests.get(main_url)
   soup = BeautifulSoup(response.text, 'html.parser')
   all_cntry = soup.select(".flagicon+ a")
   for link in all_cntry:
        cntry_link = link.get("href")
        cntry_link = "https://en.wikipedia.org/" + cntry_link
        all_countries.append(cntry_link)
   return all_countriesdef growth_rate(url):
    response = requests.get(url)
    file_name = url.split("/")[-1]
    with open(file_name + ".html", 'wb') as f:
         f.write(response.content)
         print('.',end='',flush=True)def multithreading(growth_rate , list):
    with ThreadPoolExecutor() as p:
    p.map(growth_rate, list)if __name__ == '__main__':
     start_time = time.time()
     list = get_countries()
     multithreading(growth_rate, list)     end_time = time.time() - start_time
     print(end_time)

Execution time: 7.05 sec

Code Guide: Line 4 imports the required module. The multithteading function accepts URL list and growth_rate as parameters and then we use the ThreadPoolExecutor function to execute a map method. Map method accepts a function and an iterable and executes the task.

We can also pass a parameter ‘max workers’ in ThreadPoolExecutor to manually assign threads but I left that on the system to decide for the sake of even comparison.

Result :

Without any of the multiprocessing and multithreading methods: 105.06 sec.
Using multiprocessing: 29.82 sec
Using multithreading: 7.05 sec

Conclusion:

It is quite obvious that multithreading is the fastest among the above three and the conventional method is the slowest. One thing to keep in mind is that we haven't used sessions in the first method. Using Sessions instead of request make the process much faster.

That's all for you in this article. If you like reading this article, you can follow me to get more updates from me.

Thanks!