Web scraping E-commerce sites to compare prices with Python — Part 1

Wilson Wong
7 min readJun 27, 2019

--

I’ve been told frequently that between the two major e-commerce platforms in Malaysia (Lazada and Shopee), one is generally cheaper and attracts bargain hunters while the other generally caters to the less price sensitive.

Well, I’ve decided to find out myself… in a battle of the e-commerce platforms!

To do so I’ll be writing a Python script using Selenium and the Chrome web driver to automate the scraping process and build our dataset. Here, we will be scraping for the following:

  • Product name; and
  • Product price

I will then conduct some basic data analysis using Pandas on the dataset we have scraped. As part of this exercise, some data cleaning will also be required and at the end of the exercise I will be presenting the price comparison on a simple visual chart using Matplotlib and Seaborn.

Between the two platforms, I’ve found the Shopee website more difficult to scrape for data for a couple of reasons: (1) it contains annoying popup boxes which appear when entering the page; and (2) the website class elements are not as well defined (some elements have multiple classes).

For this reason we will start with scraping the Lazada website first, and then we will deal with Shopee in Part 2!

First, we import the necessary packages:

# Web Scrapingfrom selenium import webdriver
from selenium.common.exceptions import *
# Data manipulationimport pandas as pd# Visualizationimport matplotlib.pyplot as plt
import seaborn as sns

We then initialize the global variables, in this case:

  1. the path of the Chrome webdriver;
  2. the website url; and
  3. the item we want to search.
webdriver_path = 'C://Users//me//chromedriver.exe' # Enter the file directory of the Chromedriver
Lazada_url = 'https://www.lazada.com.my'
search_item = 'Nescafe Gold refill 170g' # Chose this because I often search for coffee!

Next, we will fire up the Chrome browser. We will do so with some custom options:

# Select custom Chrome optionsoptions = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('start-maximized')
options.add_argument('disable-infobars')
options.add_argument('--disable-extensions')
# Open the Chrome browserbrowser = webdriver.Chrome(webdriver_path, options=options)
browser.get(Lazada_url)

A little bit about the options. The ‘ — headless’ argument allows you to run the script with the browser operating in the background. Normally I would recommend not adding this argument to your Chrome options, so that you will be able to see the automation in action and identify bugs more easily. The downside to this is that it’s less efficient, of course!

The other arguments, ‘start-maximised’, ‘disable-infobars’ and ‘ — disable-extensions’ are added to ensure a smoother operation of the browser (extensions that interfere with webpages especially can derail the automation process).

Running this short block of code will open the browser.

Once the browser is opened, we will need to automate the searching of the item. The Selenium tool allows you to find browser HTML elements using various methods including the id, class, CSS selectors, and also XPath which is an XML path expression.

But how do you identify which elements to find? An easy way to do this is to use Chrome’s very own inspect tool:

You can open the inspect tool with CTRL+SHIFT+I. Use the element selector identified in the red circle to hover around elements you want to find. Here we can find that the search bar has an id = ‘q’ (seen within the red box).
search_bar = browser.find_element_by_id('q')
search_bar.send_keys(search_item).submit()
The browser opened up just nicely and you will see the keywords being typed out and the search submitted if you did not use the ‘ — headless’ argument in your Chrome options. Looks like we have 9 items here!

Okay so that’s the easy part. Now comes the part which can be challenging, and even more so if you’re trying to scrape from the Shopee website!

To figure out how you would scrape the item names and prices from the Lazada, imagine how you would do it manually. What you might do is this:

  1. Copy each of the item name and its price onto a spreadsheet table;
  2. Go to the next page and repeat the first step until you’ve reached the last page

That’s exactly how we would do it as well in this automation process! To do so, we will need to find the elements containing the item names and prices, and also the next page button.

Using the same Chrome inspect tool, we can see that the product titles and prices have the class names ‘c16H9d’ and ‘c13VH6’ respectively. It’s important to check that the same class names apply to all the items on the page, in order to ensure successful scraping of all the items on the page.

item_titles = browser.find_elements_by_class_name('c16H9d')
item_prices = browser.find_elements_by_class_name('c13VH6')

Next, we unpack the item_titles and item_prices variables onto lists:

# Initialize empty liststitles_list = []
prices_list = []
# Loop over the item_titles and item_pricesfor title in item_titles:
titles_list.append(title.text)
for price in item_prices:
prices_list.append(prices.text)

Printing both lists shows the following output:

[‘NESCAFE GOLD Refill 170g x2 packs’, ‘NESCAFE GOLD Original Refill Pack 170g’, ‘Nescafe Gold Refill Pack 170g’, ‘NESCAFE GOLD Refill 170g’, ‘NESCAFE GOLD REFILL 170g’, ‘NESCAFE GOLD Refill 170g’, ‘Nescafe Gold Refill 170g’, ‘[EXPIRY 09/2020] NESCAFE Gold Refill Pack 170g x 2 — NEW PACKAGING!’, ‘NESCAFE GOLD Refill 170g’] [‘RM55.00’, ‘RM22.50’, ‘RM26.76’, ‘RM25.99’, ‘RM21.90’, ‘RM27.50’, ‘RM21.88’, ‘RM27.00’, ‘RM26.76’, ‘RM23.00’, ‘RM46.50’, ‘RM57.30’, ‘RM28.88’]

Once we’re done scraping from this page, let’s move on to the next page. Again here we will use the find_element method, but this time using XPath. Using XPath is necessary here because the next page button has two classes, and the find_element_by_class_name method only finds elements from a single class.

Also important to note here, we need to tell the browser what to do if the next page button is disabled (meaning if the results are shown only in one page or if we’ve reached the end page of the results).

try:
browser.find_element_by_xpath(‘//*[@class=”ant-pagination-next” and not(@aria-disabled)]’).click()
except NoSuchElementException:
browser.quit()

Here we’ve instructed the browser to close if the button is disabled. If it’s not disabled, it will proceed to the next page and we will then need to repeat the scraping process.

Luckily for us here, the item I searched for has only 9 items which are all displayed on one page. So that’s the end of our scraping process!

We now begin to analyze the data we’ve scraped using Pandas. We start by converting the two lists into a dataframe:

dfL = pd.DataFrame(zip(titles_list, prices_list), columns=[‘ItemName’, ‘Price’])

Printing the dataframe shows that the scraping exercise was successful:

We got all 9 items scraped. But some cleaning is required!

While the dataset looks good, it isn’t very clean. If you print the information of the dataframe using the Pandas .info() method it shows that the Price column type is a string object, rather than a float type. This is expected as each entry in the Price column contains the currency symbol ‘RM’ (Malaysian Ringgit). However, if the Price column is not an integer or float type column, we will not be able to extract any statistical features on it.

We will therefore need to remove the currency symbol and convert the entire column into a float type with the following method:

dfL[‘Price’] = dfL[‘Price’].str.replace(‘RM’, ‘’).astype(float)

Awesome! However, there’s still some more cleaning to be done. You might have noticed an anomaly in the dataset. One of the items is actually a twin pack, which we will need to remove from our dataset.

Data cleaning is essential for any sort of data analysis and here we will weed out entries that we don’t want with the following:

# This removes any entry with 'x2' in its titledfL = dfL[dfL[‘ItemName’].str.contains(‘x2’) == False]

Although unnecessary here, you may also want to ensure that the items that appear are the items we specifically searched for. Sometimes other related products may appear in your search list, especially if your search term isn’t specific enough.

For example, if we had searched ‘nescafe gold refill’ instead of ‘nescafe gold refill 170g’, 117 items would have appeared instead of just the 9 we scraped earlier. The additional items aren’t the refill packs we were searching for, but rather capsule filter cups instead.

Always important to be specific with your search terms! Just by being more specific, you would save a lot of time from cleaning data.

Nonetheless, it doesn’t hurt to filter your dataset again with your search term:

dfL = dfL[dfL[‘ItemName’].str.contains(‘170g’) == True]

As a final touch, we will also create a column ‘Platform’ and assign ‘Lazada’ to each of the entries here. This is done so that we can later group the entries by platforms (Lazada and Shopee) when we later conduct the price comparison between the two platforms.

dfL[‘Platform’] = ‘Lazada’

Et voila! Our dataset is finally clean and ready!

Now it’s time to visualize our data, with Matplotlib and Seaborn. We will be using a box plot, as it uniquely represents all the following key statistical features (also known as the five number summary) in one chart:

  • Lowest price
  • Highest price
  • Median price
  • 25th and 75th percentile price
# Plot the chartsns.set()
_ = sns.boxplot(x=’Platform’, y=’Price’, data=dfL)
_ = plt.title(‘Comparison of Nescafe Gold Refill 170g prices between e-commerce platforms in Malaysia’)
_ = plt.ylabel(‘Price (RM)’)
_ = plt.xlabel(‘E-commerce Platform’)
# Show the plotplt.show()

Each box represents a Platform, and the y-axis shows the price range. Here we will only have one box, as we have yet to scrape and analyze any data from the Shopee website.

Seaborn‘s visualization generally looks better than Matplotlib

We can see that the prices of the item range between RM21–28, with the median price in between RM27–28. We can also see that the box has short ‘whiskers’, indicating that the prices are relatively consistent without any significant outliers. For more about interpreting box plots, here is a great summary!

And that’s it for the Lazada website! In Part 2, I will walk through the specific challenges when scraping the Shopee website and then we will plot another box plot for Shopee prices to complete our comparison!

--

--

Wilson Wong

An aspiring polymath interested in data science, automation and artificial intelligence. Didn’t turn out as intended after graduating from law school in 2009.