How to improve your web scraping process by 20 times in Python!

5 min readAug 25, 2021

As a data analyst, web scraping is the skill I have to get to have the data I need to analyze and to find insights. As a result, I have tried the way to improve web scraping for a while, and after a period of time, I have found the best technique to improve web scraping with one simple technique: multiple process!

First, let me show you a simple web scraping code I create to find the short term debt for every company in Vietnam:

import timeimport pandas as pdimport requestsfrom bs4 import BeautifulSoupdict = {}dict["CongTy"] = []dict["Year"] = []dict["NoDaiHan"] = []
listYears = ['2020']
listCompany = ["AAA", "AAM","X18", "HPG", 'FLC', 'QBS', 'NKG', 'VHM']start = time.time()for year in listYears:   print(year)for company in listCompany:   print(company)   dict["Company"].append(company)   dict["Year"].append(year)   page = requests.get('https://s.cafef.vn/bao-cao-tai-chinh/' +cong_ty + '/BSheet/' + year + '/0/0/0/bao-cao-tai-chinh-cong-ty-co-phan-nhua-an-phat-xanh.chn')   soup = BeautifulSoup(page.content, 'html.parser')  # Get all the HTMl of the page   allField = soup.find_all('table', id="tableContent")   if(len(allField) > 0):      full_data = allField[0].find_all('td', class_ = "b_r_c")   for i in range(len(full_data)):      if(full_data[i].getText().strip() == "2. Nợ dài hạn"):         nodaihan = full_data[i+4].getText().strip()         if(len(nodaihan) == 0):            print("Empty")            dict['NoDaiHan'].append(-1)         else:             print("----NoDaiHan-----", nodaihan)             dict["NoDaiHan"].append(nodaihan)database = pd.DataFrame.from_dict(dict)print(database)end = time.time()
print("The time it takes will be: ", end- start)

In here, it took 16 seconds to run 8 companies for 1 year. Here is the result of the code:

2013 AAA ----NoDaiHan----- 83,082,921,652 
     AAM ----NoDaiHan----- 3,093,285,071 
     X18 Empty 
     HPG ----NoDaiHan----- 2,346,896,440,179 
     FLC ----NoDaiHan----- 127,449,555,740 
     QBS ----NoDaiHan----- 589,000,000 
     NKG ----NoDaiHan----- 579,385,070,260 
     VHM Empty 
CongTy  Year           NoDaiHan 
0    AAA  2013     83,082,921,652 
1    AAM  2013      3,093,285,071 
2    X18  2013                 -1 
3    HPG  2013  2,346,896,440,179 
4    FLC  2013    127,449,555,740 
5    QBS  2013        589,000,000 
6    NKG  2013    579,385,070,260 
7    VHM  2013                 -1 
The time is:  16.470054388046265

While it is certainly much faster than go to every single website to get the information, the speed is not great. Imagine having to do that for 1700 companies for 12 years! It will take a lot of time to get the data, then to fix the format and analyze the information. The whole process might take weeks or months to complete if we need to get a large chunk of data.

After a long period of time researching and trying to find new ideas, I have found a new technique to improve the web scraping process much quicker and faster: multiprocessing.

So, what is multiprocessing?

Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken to smaller routines that run independently. The operating system allocates these threads to the processors improving performance of the system.

Why multiprocessing?

Imagine you are working as a CEO in a company. If you work alone, you will need to do anything, including preparing the paper, thinking about the workflow, finishing the process, etc.

However, working alone will take a lot of time. The more tasks you must do at once, the more difficult it gets to keep track of them all. Therefore, it will be extremely important to use multiprocess — to fully take advantages of this powerful tool.

So, how to use multiprocessing?

First, we need to import the multiprocessing module, by using:

import multiprocessing

To create a process, we create an object of Process class. It takes following arguments:

target: the function to be executed by process
args: the arguments to be passed to the target function.

p1 = multiprocessing.Process(target=A, args=(B, ))

After that, we will need to start the multiprocess:

p1.start()

Once the processes start, the current program also keeps on running the code. To end the process, we will use the join method:

p1.join()

So, how can we use it to increase our crawling speed? We can easily do that by putting our web scraping into a function, after that, call the method using multiprocess! With multiprocess, we can definitely increase our speed a lot faster!

Here is how I complete my multiprocess crawling:

listCompany = ["AAA", "AAM","X18", "HPG", 'FLC', 'QBS', 'NKG', 'VHM']import pandas as pdimport itertoolsfrom scipy import statsfrom multiprocessing import Poolimport timefrom concurrent.futures import ProcessPoolExecutorimport multiprocessingimport pandas as pdimport requestsfrom bs4 import BeautifulSoupimport timedict = {}dict["CongTy"] = []dict["Year"] = []dict['NoDaiHan'] = []
def crawl(list, year): for year in listYears:     print(year)     for cong_ty in list:       print(cong_ty)       dict["CongTy"].append(cong_ty)       dict["Year"].append(year)       page = requests.get('https://s.cafef.vn/bao-cao-tai-chinh/' +      cong_ty + '/BSheet/' + strYear + '/0/0/0/bao-cao-tai-chinh-cong-ty-co-phan-nhua-an-phat-xanh.chn')       soup = BeautifulSoup(page.content, 'html.parser')  # Get all the HTMl of the page       tenMienAll = soup.find_all('table', id="tableContent")       if(len(tenMienAll) > 0):         full_dulieu = tenMienAll[0].find_all('td', class_ = "b_r_c")       for i in range(len(full_dulieu)):         if(full_dulieu[i].getText().strip() == "2. Nợ dài hạn"):           nodaihan = full_dulieu[i+4].getText().strip()           if(len(nodaihan) == 0):              print("Empty")              dict['NoDaiHan'].append(-1)           else:               print("----NoDaiHan-----", nodaihan)               dict["NoDaiHan"].append(nodaihan) database = pd.DataFrame.from_dict(dict) print(database)if __name__ == "__main__":# creating processesstart = time.time()p1 = multiprocessing.Process(target=crawl, args=(listCompany, 2013,))# starting processesp1.start()# wait until processes are finishedp1.join()end = time.time()print('=====================TAKES========================= ')print(end-start)

Here is the output of the process:

2013 AAA ----NoDaiHan----- 83,082,921,652 
2013 AAM ----NoDaiHan----- 3,093,285,071 
2013 X18 Empty 
2013 HPG ----NoDaiHan----- 2,346,896,440,179 
2013 FLC ----NoDaiHan----- 127,449,555,740 
2013 QBS ----NoDaiHan----- 589,000,000 
2013 NKG ----NoDaiHan----- 579,385,070,260 
2013 VHM Empty   
CongTy  Year           NoDaiHan 
0    AAA  2013     83,082,921,652 
1    AAM  2013      3,093,285,071 
2    X18  2013                 -1 
3    HPG  2013  2,346,896,440,179 
4    FLC  2013    127,449,555,740 
5    QBS  2013        589,000,000 
6    NKG  2013    579,385,070,260 
7    VHM  2013                 -1 =====================TAKES=========================  8.48030161857605

Using only one multiprocess, the time it takes reduces by half, from 16 seconds to around 8 seconds. We can improve the process much quicker and faster if we use more than one multiprocess. For example, on a normal computer, we can use 8 multiprocess at a time, which will greatly reduce the time it takes to complete the task.

So that is a small article to discuss about the way to improve your web scraping. I hope you enjoy it! Thank you so much!

How to improve your web scraping process by 20 times in Python!

Written by Daniel Ngo