How to improve your web scraping process by 20 times in Python!

Daniel Ngo
5 min readAug 25, 2021

--

Web scraping

As a data analyst, web scraping is the skill I have to get to have the data I need to analyze and to find insights. As a result, I have tried the way to improve web scraping for a while, and after a period of time, I have found the best technique to improve web scraping with one simple technique: multiple process!

First, let me show you a simple web scraping code I create to find the short term debt for every company in Vietnam:

import timeimport pandas as pdimport requestsfrom bs4 import BeautifulSoupdict = {}dict["CongTy"] = []dict["Year"] = []dict["NoDaiHan"] = []
listYears = ['2020']
listCompany = ["AAA", "AAM","X18", "HPG", 'FLC', 'QBS', 'NKG', 'VHM']start = time.time()for year in listYears: print(year)for company in listCompany: print(company) dict["Company"].append(company) dict["Year"].append(year) page = requests.get('https://s.cafef.vn/bao-cao-tai-chinh/' +cong_ty + '/BSheet/' + year + '/0/0/0/bao-cao-tai-chinh-cong-ty-co-phan-nhua-an-phat-xanh.chn') soup = BeautifulSoup(page.content, 'html.parser') # Get all the HTMl of the page allField = soup.find_all('table', id="tableContent") if(len(allField) > 0): full_data = allField[0].find_all('td', class_ = "b_r_c") for i in range(len(full_data)): if(full_data[i].getText().strip() == "2. Nợ dài hạn"): nodaihan = full_data[i+4].getText().strip() if(len(nodaihan) == 0): print("Empty") dict['NoDaiHan'].append(-1) else: print("----NoDaiHan-----", nodaihan) dict["NoDaiHan"].append(nodaihan)database = pd.DataFrame.from_dict(dict)print(database)end = time.time()
print("The time it takes will be: ", end- start)

In here, it took 16 seconds to run 8 companies for 1 year. Here is the result of the code:

2013 AAA ----NoDaiHan----- 83,082,921,652 
AAM ----NoDaiHan----- 3,093,285,071
X18 Empty
HPG ----NoDaiHan----- 2,346,896,440,179
FLC ----NoDaiHan----- 127,449,555,740
QBS ----NoDaiHan----- 589,000,000
NKG ----NoDaiHan----- 579,385,070,260
VHM Empty
CongTy Year NoDaiHan
0 AAA 2013 83,082,921,652
1 AAM 2013 3,093,285,071
2 X18 2013 -1
3 HPG 2013 2,346,896,440,179
4 FLC 2013 127,449,555,740
5 QBS 2013 589,000,000
6 NKG 2013 579,385,070,260
7 VHM 2013 -1
The time is: 16.470054388046265

While it is certainly much faster than go to every single website to get the information, the speed is not great. Imagine having to do that for 1700 companies for 12 years! It will take a lot of time to get the data, then to fix the format and analyze the information. The whole process might take weeks or months to complete if we need to get a large chunk of data.

After a long period of time researching and trying to find new ideas, I have found a new technique to improve the web scraping process much quicker and faster: multiprocessing.

So, what is multiprocessing?

Multiprocessing refers to the ability of a system to support more than one processor at the same time. Applications in a multiprocessing system are broken to smaller routines that run independently. The operating system allocates these threads to the processors improving performance of the system.

Why multiprocessing?

Imagine you are working as a CEO in a company. If you work alone, you will need to do anything, including preparing the paper, thinking about the workflow, finishing the process, etc.

However, working alone will take a lot of time. The more tasks you must do at once, the more difficult it gets to keep track of them all. Therefore, it will be extremely important to use multiprocess — to fully take advantages of this powerful tool.

So, how to use multiprocessing?

First, we need to import the multiprocessing module, by using:

import multiprocessing

To create a process, we create an object of Process class. It takes following arguments:

  • target: the function to be executed by process
  • args: the arguments to be passed to the target function.
p1 = multiprocessing.Process(target=A, args=(B, ))

After that, we will need to start the multiprocess:

p1.start()

Once the processes start, the current program also keeps on running the code. To end the process, we will use the join method:

p1.join()

So, how can we use it to increase our crawling speed? We can easily do that by putting our web scraping into a function, after that, call the method using multiprocess! With multiprocess, we can definitely increase our speed a lot faster!

Here is how I complete my multiprocess crawling:

listCompany = ["AAA", "AAM","X18", "HPG", 'FLC', 'QBS', 'NKG', 'VHM']import pandas as pdimport itertoolsfrom scipy import statsfrom multiprocessing import Poolimport timefrom concurrent.futures import ProcessPoolExecutorimport multiprocessingimport pandas as pdimport requestsfrom bs4 import BeautifulSoupimport timedict = {}dict["CongTy"] = []dict["Year"] = []dict['NoDaiHan'] = []
def crawl(list, year): for year in listYears: print(year) for cong_ty in list: print(cong_ty) dict["CongTy"].append(cong_ty) dict["Year"].append(year) page = requests.get('https://s.cafef.vn/bao-cao-tai-chinh/' + cong_ty + '/BSheet/' + strYear + '/0/0/0/bao-cao-tai-chinh-cong-ty-co-phan-nhua-an-phat-xanh.chn') soup = BeautifulSoup(page.content, 'html.parser') # Get all the HTMl of the page tenMienAll = soup.find_all('table', id="tableContent") if(len(tenMienAll) > 0): full_dulieu = tenMienAll[0].find_all('td', class_ = "b_r_c") for i in range(len(full_dulieu)): if(full_dulieu[i].getText().strip() == "2. Nợ dài hạn"): nodaihan = full_dulieu[i+4].getText().strip() if(len(nodaihan) == 0): print("Empty") dict['NoDaiHan'].append(-1) else: print("----NoDaiHan-----", nodaihan) dict["NoDaiHan"].append(nodaihan) database = pd.DataFrame.from_dict(dict) print(database)if __name__ == "__main__":# creating processesstart = time.time()p1 = multiprocessing.Process(target=crawl, args=(listCompany, 2013,))# starting processesp1.start()# wait until processes are finishedp1.join()end = time.time()print('=====================TAKES========================= ')print(end-start)

Here is the output of the process:

2013 AAA ----NoDaiHan----- 83,082,921,652 
2013 AAM ----NoDaiHan----- 3,093,285,071
2013 X18 Empty
2013 HPG ----NoDaiHan----- 2,346,896,440,179
2013 FLC ----NoDaiHan----- 127,449,555,740
2013 QBS ----NoDaiHan----- 589,000,000
2013 NKG ----NoDaiHan----- 579,385,070,260
2013 VHM Empty
CongTy Year NoDaiHan
0 AAA 2013 83,082,921,652
1 AAM 2013 3,093,285,071
2 X18 2013 -1
3 HPG 2013 2,346,896,440,179
4 FLC 2013 127,449,555,740
5 QBS 2013 589,000,000
6 NKG 2013 579,385,070,260
7 VHM 2013 -1 =====================TAKES========================= 8.48030161857605

Using only one multiprocess, the time it takes reduces by half, from 16 seconds to around 8 seconds. We can improve the process much quicker and faster if we use more than one multiprocess. For example, on a normal computer, we can use 8 multiprocess at a time, which will greatly reduce the time it takes to complete the task.

So that is a small article to discuss about the way to improve your web scraping. I hope you enjoy it! Thank you so much!

--

--