Web Scraping with Python in Indonesian E-Commerce (Tokopedia) Part 1: Getting the Data

Yohan Ardiansyah
6 min readJul 4, 2022

--

In 2006, British mathematician and Tesco marketing mastermind Clive Humby shouted from the rooftops, “Data is the new oil”.

And today we can see data has become a powerful weapon that can influence the direction of the world. It can decide the next action that needs to be taken in a business, increase goods selling by providing the product related to the customer’s taste, create good Artificial Intelligence to minimize human work, etc.

In this article, we will study how to get data from an existed website, this action is usually called web scraping. For this one, we will use Tokopedia, one Indonesian E-Commerce, as a study case.

The first step to web scraping is to decide what data we want to get. In this case, I want to get shoe(sepatu) data and it will be sorted by the review(ulasan).

Then let’s see the website. The website is shaped by a markup language named HTML. And we can get the data that we need by searching the HTML of the page thoroughly.

First, let’s open Tokopedia’s page at https://www.tokopedia.com.

Then let’s search for shoes on the search bar. Shoes in Indonesia are called “sepatu”, so I will insert the word “sepatu” into the search bar.

But it’s sorted by the most suitable one, so let’s change to sort by the review by changing the dropdown “Urutkan” to the “Ulasan”.

Then let’s see the HTML by using inspect elements and pointing to the product’s card.

We can see that the card has a class named css-y5gcsw. Then inside the card, we can see some information about the product.

And I’m interested in the name, price, city, and the image URL of the product so let’s try to see the HTML element of these data.

And we can see we can get the name with css-1b6t4dn class, the price with css-1ksb19c class, the city with css-1kdc32b class, and the image with css-1c345mg class.

After recognizing the HTML of the page, then let’s create a script to get the data from the page.

Because Tokopedia is using JavaScript Framework to build its website, we will be using a browser automation library named Selenium. We can get the data from the HTML with this library. But of course, you need to install the library first and we need a browser too. You can follow the installation of Selenium at this link and don’t forget to use Python’s virtual environment for this project. Then for the browser itself, we will be using Firefox for the automation process.

Then, let’s create a file named scraper.py as a place for the Scraper class to reside.

Let’s create a class named Scraper that will have the responsibility to get the data from the website. In this class we let's create a property named driver that will be filled with Selenium Webdriver. Webdriver is a class that Selenium will use to create a session with the browser and communicate with the browser. So if webdriver commands the browser to open a certain page, then the page will be opened in the browser. To create a Webdriver object that is connected to the Firefox browser, we can call the static function Firefox() from the Webdriver class.

from selenium import webdriver

class Scraper:

def __init__(self):
self.driver = webdriver.Firefox()

Then let’s create a function named get_data() to get the data from the website. For this purpose, we need to have an URL from the website. If we see the website again, we can see the URL is https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu.

Let’s make the driver command the browser to get the URL by calling driver.get("URL") function.

def get_data(self):
self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')

Then let’s create a counter for the page that shows the products and list to place all the data.

counter_page = 0
datas = []

We will get the data until page 10. For each page, we will make the driver command the browser to scroll until the end of the page because the page will not load the data if we did not scroll through it. When I checked the page, I found that the page has around 6500 pixels and we will scroll per 500 pixels. For each iteration, we will wait for 0.1 seconds so we did not put some load at the same time to the server.

while counter_page < 10:
for _ in range(0, 6500, 500):
time.sleep(0.1)
self.driver.execute_script("window.scrollBy(0,500)")

And after the iteration for scrolling, we will get the card’s element, iterator over all the elements, get the name, price, city, and image data, and finally put the data to the datas variable.

elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text

datas.append({
'img': img,
'name': name,
'price': price,
'city': city
})

And after we get all the data, we can go to the next page by making the driver click to the next page. If we check the HTML of the page, we can find that the page button has css-1ix4b60-unf-pagination-item its class. And we can specify which button we want to click by using the counter variable.

counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()

And finally, return the data as the function’s return value.

return datas

For the overall codes, you can check this.

from selenium.webdriver.common.by import By
from selenium import webdriver
import time

class Scraper:
def __init__(self):
self.driver = webdriver.Firefox()

def get_data(self):
self.driver.get('https://www.tokopedia.com/search?navsource=&ob=5&srp_component_id=02.01.00.00&srp_page_id=&srp_page_title=&st=product&q=sepatu')

counter_page = 0
datas = []

while counter_page < 10:
for _ in range(0, 6500, 500):
time.sleep(0.1)
self.driver.execute_script("window.scrollBy(0,500)")

elements = self.driver.find_elements(by=By.CLASS_NAME, value='css-y5gcsw')
for element in elements:
img = element.find_element(by=By.CLASS_NAME, value='css-1c345mg').get_attribute('src')
name = element.find_element(by=By.CLASS_NAME, value='css-1b6t4dn').text
price = element.find_element(by=By.CLASS_NAME, value='css-1ksb19c').text
city = element.find_element(by=By.CLASS_NAME, value='css-1kdc32b').text

datas.append({
'img': img,
'name': name,
'price': price,
'city': city
})

counter_page += 1
next_page = self.driver.find_element(by=By.XPATH, value="//button[@class='css-1ix4b60-unf-pagination-item' and text()='" + str(counter_page + 1) + "']")
next_page.click()

return datas

And then let’s create a file named “main.py” to check the class functionality. Fill the file with this code.

from scraper import Scraper

if __name__ == "__main__":
scraper = Scraper()

datas = scraper.get_data()
index = 1

for data in datas:
print(
index,
data['name'],
data['img'],
data['price'],
data['city']
)

index += 1

If we run the file, we will open the Firefox browser and the browser will automatically navigate as the driver instructed in our code. Then we can see the result from the terminal.

We can see that we get 700 product data from the shoe searching page!!!

Next step, we will try to present the data in another format than printing to the terminal directly.

You can check the overall step on my blog with this link or the next step on medium with this link.

If you want to discuss something just contact me on my LinkedIn.

Thank you very much and goodbye.

--

--

Yohan Ardiansyah

Geeks who are interested in Software Engineering and Computer Networking. More at https://www.software-engineer-story.com