How To Build Web Scraping Application Using BeautifulSoup and Flask — Part I

https://hamzic.files.wordpress.com/2016/12/web-data-scraping.png

Web scraping is a term for various method that used to extract or fetch data from a website. These data may be a product data, weather data, auction data, etc. The data that obtained from these methods can be managed into various data formats such as JSON, Spreadsheet or stored into the database.

Case Study
In this article we will try to make an application to retrieve the best selling product data from one of the largest e-commerce site in Indonesia, namely www.bukalapak.com. From that site we will take the item name, price, image, link and reviews, After that we will display the data into RESTful API.

This article will be divided into two parts, the first article explains how to scraping data using beautifulsoup and save it into the list/array and the second article explains how to display the data into RESTful API.

Preparation
There are many tools that we can use to make this web scraping application. But, because i’m very familiar with python so i’m using BeautifulSoup python library to build this application.

To start this project it would be nice to setup the virtual environment, so that the library we use doesn’t conflict with other application libraries. To create virtual environment is very easy, we just type a command python3 -m virtualenv .env on the linux terminal. Make sure python3 and pip are properly installed. You can surf the internet to find how to install python3 and pip

python3 -m virtualenv .env 
Using base prefix '/usr'
New python executable in /home/handry/.env/bin/python3
Also creating executable in /home/handry/.env/bin/python
Installing setuptools, pip, wheel...done.

The above command will create a virtual environment in the folder .env that contains python core and all its supporting libraries. To enable virtual environment can by using command :

source .env/bin/activate
(.env) handry@sangmadesu:~$

If success, the terminal will display the virtual environment name at the beginning of bash as shown above.

Install the required libraries
The next step after creating a virtual environment and activated is to install the supporting library. In this project i use beautifulsoup to do scraping, requests for handle the URL and Flask as web framework. To install of them can be done with the command:

pip install beautifulsoup4 & pip install Flask & pip install requests

Let’s Start
In this case we want to retrieve the best sellling product data from bukalapak.com, so let’s visit it.

The first thing to know in scraping is the the link site and we have got it https://www.bukalapak.com/promo/serba-serbu-produk-terlaris-bukalapak?from=old-popular-section-3

After that we will explore the structure of HTML and Class Attribute it uses. To do this we can use devtools from google chrome. Of course, i will not explain how to use devtools in this article, you can find it in here.

Devtools Overview

After looking all the structure and makes me feel tired, we get some things:
1. All products are in<div class=”product-card”>
2. Product name and link are in <a class=”product__name”>
3. Product picture in <img class=”product-media__img”>
4. Product price in <span class=”amount”>
5. Product reviews in <a class=”review__aggregate”>

After knowing it, we will take the data from HTML element above.

Let’s Code
To start writing the code, you can use text editor that you like. I ussualy use pycharm or sublime text, because i think both are powerfull.

You can see all the code below.

All the code available in my github

For the first you should import the libraries that we have installed, then we create a function called ‘scrape’ that will accomodate all the scraping processes in our application.

from bs4 import BeautifulSoup
import requests
def scrape (
...
)

Then we will create variable ‘l’ that contains the empty list to accomodate the final result.

l = []
for page in range(0, 3):
page = page + 1
base_url = 'https://www.bukalapak.com/promo/serba-serbu-produk- terlaris-bukalapak?from=old-popular-section-3&page=' + str(page)

In the above code we also do a loop in order to get the link along with the page number. If we will debugging by adding code print(base_url), we will get result like this.

https://www.bukalapak.com/promo/serba-serbu-produk-terlaris-bukalapak?from=old-popular-section-3&page=1
https://www.bukalapak.com/promo/serba-serbu-produk-terlaris-bukalapak?from=old-popular-section-3&page=2
https://www.bukalapak.com/promo/serba-serbu-produk-terlaris-bukalapak?from=old-popular-section-3&page=3

After that we will make request to link and parsing it into html format using requests and beautifulsoup. In ‘all_product’ variable we assign the result of a search for the div tag that has the class=”product-card” attribute on the page.

r = requests.get(base_url)
soup = BeautifulSoup(r.text, "html.parser")
all_product = soup.find_all('div', class_="product-card")

After getting all the data from the product that we want, Then we do the iteration again to get the name of product, price, image, link and reviews. But before that we must first define the variable ‘d’ with the dict data type to hold the results of all data obtained.

for item in all_product:
d = { }

Then we start taking the name of product, image, link, price and reviews.

product_name = item.find("a", {"class":"product__name"}) product_link = 'https://www.bukalapak.com' + str(product_name.get('href'))
product_name = product_name.text.replace('\n', "").strip() d['product_link'] = product_link
d['product_name'] = product_name

The way as above will continue to repeat to retrieve other data, you just replace a tag name and its class attribute name. You can see the code in here.

And finnaly we will enter the data item into varable ‘l’ using append() method and return it with commad return l.

l.append(d)
return l

Then we call the function using this code.

if __name__ == "__main__":
print(scrape())

If the code works, we will get data like this.

Result of data scraping

Done… now we have successfully to scraping data from a website and save it into list. In the second article later, i will write about how to display data into JSON and RESTful API using Flask web framework.

If you want to see all the code, you can visit my github. So… wait for my story in the next article guys :)

Thanks.

Part II https://medium.com/@wahyudihandry/how-to-build-web-scraping-application-using-beautifulsoup-and-flask-part-2-99070aebf586

Like what you read? Give Handry Wahyudi a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.