Multiple Page Shopee Web Scraping Using Selenium and Python (November 2022)

Bimo Widyatamoko
6 min readNov 29, 2022

--

Photo by Sai Kiran Anagani on Unsplash

Nowadays web scraping is very common task for multiple purpose, such as marketing, analyzing competitor and just some research. Basically its just a way to “copy and paste” a lot of things in your computer screen in an automated way, so you dont have to do labour intensive work.

Before we strart, I want to introduce selenium first and why I choose selenium as my tools of choice, here us some feature from selenium:

  • Easy to read & code,
  • Open Source,
  • Fast In Execution,
  • Can run tests across different browsers,
  • Automates Browser Easily,
  • Beginners Friendly

Selenium is a python library that can scrape dynamic web easily. It is used also used for web automation & testing. Scraping data from the web is a small part of its library.

In this article I will try to do web scraping on the shopee.co.id website by paginating it and exporting it to csv.

some of the data that I will retrieve is:

  1. Title (product title)
  2. Price (normal or ranged prices)
  3. Sales (sales amount)
  4. Link (product link)
shopee searching page

installation:

pip install selenium

here some library that i used in the project

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
import bs4
import pandas as pd

An automated tool called WebDriver is used to test web applications on various browsers. It offers the ability to move between various web page components, including user input, buttons, JavaScript execution, and more. Selenium supplies web drivers for all the different web browsers it supports. However, I find that Selenium works best with Chrome.

You must have webdriver first to be able to run selenium, Visit the selenium documentation, choose your browser, then download the web driver’s most recent stable version. Move your web driver to a convenient location once it has been downloaded, or for simplicity, simply move it to the folder you are working on.

Define some input:

cari= input("keyword pencarian:") #search keyword
hal= input("berapa halaman:") #how many pages to scrap
nama_file= input("nama file(sertakan.csv):") #output filename

some option to help you navigate the driver more easily:

#webdriver option
opt= webdriver.ChromeOptions()
opt.add_argument('--no-sandbox') #Disables the sandbox for all process types that are normally sandboxed. Meant to be used as a browser-level switch for testing purposes only.
opt.add_argument('--headless') #Run in headless mode, i.e., without a UI or display server dependencies.
opt.add_argument('--disable-notifications')#Disables the Web Notification and the Push APIs.

Here is some article to discover more about webdriver options and switches

Basic code:

driver = webdriver.Chrome(executable_path='driver path',options=opt)#setting up webdriver and option
driver.get('https://shopee.co.id/') #target website
time.sleep(1) #pauses to fully load the page and prevents the web from detecting us as a bot

Do a search:

#search
search = driver.find_element(By.XPATH,'//*[@id="main"]/div/header/div[2]/div/div[1]/div[1]/div/form/input')
search.send_keys(cari)
search.send_keys(Keys.ENTER)
time.sleep(5)

First we have to find the required element, namely the search bar. selenium can use several ways to find elements in a web page such as using XPATH, CSS, CLASS, or TAG NAME

We use send keys to enter the previously entered input and keys.enter to run the search

Make preparations for fetching data from the page:

# zoom out
driver.execute_script("document.body.style.zoom='10%'")
time.sleep(2)

# blank var to contain the page data
data=str()

Because shopee will display data after scrolling, we can overcome this by zooming out using the execute script feature, this feature is used to run javascript manually.

We also have to set up an empty variable to hold all the data from several pages that we will fetch later. notice I used str() to store data in string form because later we will add data in string form as well.

Collect data across multiple pages:

# pagination
for k in range (int(hal)) :
#fetch all the data on the page
data += driver.page_source
time.sleep(5)

# navigate to the next page
next = driver.find_element(By.CSS_SELECTOR,' button.shopee-icon-button.shopee-icon-button--right ')
driver.execute_script("arguments[0].click();", next)
time.sleep(5)

driver.close()

the method I use is to retrieve all the data on the page using driver.page_source and continue to add to it as more pages are added. By using a for-loop, we can loop according to the input number of pages

this time I’m using CSS selectors, in this section I can’t directly use the .click() feature for some reasons that is:

  1. HTML content is rendered with the button disabled.
  2. The selenium web driver script was executed before the javascript onload event was triggered (Or finished executing). So the button.click would occur on a disabled element. And nothing would happen.
  3. Then the javascript onload event would trigger (or finish executing) and the javascript would enable the button.
  4. I looked at the page and couldn’t figure out why my code wasn’t working because the button appeared to be enabled upon inspection, and if I manually clicked the button, it worked.

Creating yet another empty variable:

# empty variables to concatenate data
data_dict_list = []

Pull the data using Beautifulsoup:

# parse the data for all product
soup = bs4.BeautifulSoup(data)
all_product = soup.find_all('div',{'class':"col-xs-2-4 shopee-search-item-result__item"})

in this section I use beautifulsoup to parse htlm data, simply arrange it so that the data can be navigated and read easily

Using the find_all tool, I can find all the class names (to find out the class name you can use inspect element) in the <div> wrapper as in the code above, similar to using find_elements in selenium.

inpect element to get information

Enter each data into the index list :

#organize and tidy up the data to each index
for product in all_product:
title_element = product.find('div',{'class':'ie3A+n bM+7UW Cve6sh'})
title_text = title_element.text

price_element = product.find('div',{'class':'hpDKMN'})
price_text = price_element.text

sales_element = product.find('div',{'class':'r6HknA uEPGHT'})
if sales_element is None:
sales_text = None
else:
sales_text = terjual_element.text

product_link_element = product.find('a')
product_link = product_link_element.get('href')

#append data in dict()
data_dict = dict()
data_dict['title'] = title_text
data_dict['price'] = price_text
data_dict['sales'] = sales_text
data_dict['link'] = product_link
data_dict_list.append(data_dict)

In this section we will enter all the products in all_product into each index, we use find because there is only one class for each product that contains the data we are looking for, to retrieve the text we use .text.

Pay attention to the sales section I use an if-else statement to avoid errors because not all products have sales data.

Then we input all the data into each index, and we append it to the data_dict_list.

Finally, export to csv:

#convert it into dataframe using pandas   
data_df = pd.DataFrame(data_dict_list)
#convert into csv
data_df.to_csv(nama_file,index=False,sep=';')

before it can be converted to csv we have to convert the data into a dataframe first

Data frame output

then we can change it to csv so that it can be processed more easily in excel or other tools

using excel for more familiar way of crunching the numbers

So this is one way to do web scraping for various types of websites, from here you can develop it according to your needs and what data will be retrieved.

For the next project is to make the results of the web scraping so that they can be directly processed by cleaning the data using regular expressions and filling in the blanks so that statistics can be directly carried out in order to get interesting conclusions.

for the complete code available on my github, in the test1.py section, it can be directly run for use.

Terimakasih dan selamat mencoba ! ( thanks and good luck )

--

--

Bimo Widyatamoko

Data Enthusiast | Python, pandas, selenium, numpy, seaborn, excel, power BI, google data studio, google sheet