Data scraping time comparison| Python multiprocessing vs multithreading.

Animesh Singh
4 min readMay 1, 2022

--

Scrapy is the most used Python library for data scraping. The main reason behind this is its speed. It is very well optimised and is specifically designed to handle multiple requests and parse them in the fastest and most efficient way possible.

But what if someone wants fast data scraping without using Scrapy or other Python libraries? Is it possible to manually optimise and configure a scraper to make it work like Scrapy?

The answer is yes if that user knows about multiprocessing and multithreading. Multiprocessing allows you to create programs that can run concurrently (bypassing the GIL) and use the entirety of your CPU core. On the other hand, Multithreading shares the same data space with the main thread and can therefore share information or communicate with each other more easily than if they were separate processes

To know more -https://towardsdatascience.com/multithreading-vs-multiprocessing-in-python-3afeb73e105f

In this article :

  • We are going to scrape the HTML of more than 150 links from a list and store them in the system.
  • We will be using multithreading and multiprocessing for this process and will later compare them on the basis of speed.

In this article, we are using a base URL https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population

From this webpage, we will extract hrefs of elements under selector ”.flagicon+ a”. This is nothing but a link to the demographic of a country for Ex.https://en.wikipedia.org/wiki/Demographics_of_India. Then we will simply dump the html in a file.

Without using multiprocessing or multithreading :

import requests
from bs4 import BeautifulSoup
import time
def get_countries(): main_url =" https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population" all_countries = []
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'html.parser')
all_cntry = soup.select(".flagicon+ a")
for link in all_cntry:
cntry_link = link.get("href")
cntry_link = "https://en.wikipedia.org/" + cntry_link
all_countries.append(cntry_link)
return all_countries
def growth_rate(url):
response = requests.get(url)
file_name = url.split("/")[-1]
with open(file_name + ".html", 'wb') as f:
f.write(response.content)
print('.',end='',flush=True)
if __name__ == '__main__':
start_time = time.time()
list = get_countries()
for items in list:
growth_rate(items)
end_time = time.time() - start_time
print(end_time)

Execution time= 105.06 sec

Code Guide: get_countries function returns a list of URLs of demographic of 241 countries. growth_rate function accepts a parameter “url”
and it dumps the entire HTML of that URL in an HTML file. In this code, we are simply using for loop to execute the web scraping task.

Multiprocessing :

import requests
from bs4 import BeautifulSoup
import time
from multiprocessing import Pool, cpu_count
def get_countries():main_url =" https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"all_countries = []
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'html.parser')
all_cntry = soup.select(".flagicon+ a")
for link in all_cntry:
cntry_link = link.get("href")
cntry_link = "https://en.wikipedia.org/" + cntry_link
all_countries.append(cntry_link)
return all_countries
def growth_rate(url):
response = requests.get(url)
file_name = url.split("/")[-1]
with open(file_name + ".html", 'wb') as f:
f.write(response.content)
print('.',end='',flush=True)
def multiprocessing(growth_rate , list):
with Pool(cpu_count()) as p:
p.map(growth_rate, list)
if __name__ == '__main__':
start_time = time.time()
list = get_countries()
multiprocessing(growth_rate , list)
end_time = time.time() - start_time
print(end_time)

Execution time: 29.82 sec

Code Guide: Line 4 imports the required module. multiprocessing function accepts URL list and growth_rate as parameters and then we use Pool function with cpu_count() as a parameter to execute a map method. Map method accepts a function and an iterable and executes the task.

Multithreading :

import requests
from bs4 import BeautifulSoup
import time
from concurrent.futures import ThreadPoolExecutor
def get_countries():main_url =" https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"all_countries = []
response = requests.get(main_url)
soup = BeautifulSoup(response.text, 'html.parser')
all_cntry = soup.select(".flagicon+ a")
for link in all_cntry:
cntry_link = link.get("href")
cntry_link = "https://en.wikipedia.org/" + cntry_link
all_countries.append(cntry_link)
return all_countries
def growth_rate(url):
response = requests.get(url)
file_name = url.split("/")[-1]
with open(file_name + ".html", 'wb') as f:
f.write(response.content)
print('.',end='',flush=True)
def multithreading(growth_rate , list):
with ThreadPoolExecutor() as p:
p.map(growth_rate, list)
if __name__ == '__main__':
start_time = time.time()
list = get_countries()
multithreading(growth_rate, list)
end_time = time.time() - start_time
print(end_time)

Execution time: 7.05 sec

Code Guide: Line 4 imports the required module. The multithteading function accepts URL list and growth_rate as parameters and then we use the ThreadPoolExecutor function to execute a map method. Map method accepts a function and an iterable and executes the task.

We can also pass a parameter ‘max workers’ in ThreadPoolExecutor to manually assign threads but I left that on the system to decide for the sake of even comparison.

Result :

  • Without any of the multiprocessing and multithreading methods: 105.06 sec.
  • Using multiprocessing: 29.82 sec
  • Using multithreading: 7.05 sec

Conclusion:

It is quite obvious that multithreading is the fastest among the above three and the conventional method is the slowest. One thing to keep in mind is that we haven't used sessions in the first method. Using Sessions instead of request make the process much faster.

That's all for you in this article. If you like reading this article, you can follow me to get more updates from me.

Thanks!

--

--