Web scraping E-commerce sites to compare prices with Python — Part 2

Wilson Wong
6 min readOct 28, 2019

--

In Part 1 of this two-part series on web scraping e-commerce sites for price comparison, we explored the use of the Selenium python package to automate the process of scraping product names and prices from the Lazada website.

In Part 2, we will continue the scraping exercise, this time on the Shopee website. Here I will focus on specific challenges with scraping the Shopee website instead of repeating the steps I’ve articulated in Part I, and along the way I will also introduce an alternative to Selenium which worked better!.

Let’s get straight into it.

Scraping the Shopee website wasn’t as simple when using the Selenium tool, and I’ve highlighted a 4 additional complexities the Shopee website had that the Lazada website didn’t:

  1. Popup alerts (additional complexity = low)

The first issue we encounter is the popup alerts that appear when you conduct the search:

We could automate the clicking away of popup boxes with Selenium with the following script:

WebDriverWait(browser, 20).until(EC.element_to_be_clickable(
(By.XPATH, “//div[@class=’shopee-modal__container’]//button[text()=’English’]”))).click()

2. Multiple prices for the same item (additional complexity = low)

We also find that sometimes in the Shopee search results a single item may have two different price figures with the same class name. The different prices reflect a price range where the item has a volume discount:

The class selector is pretty unreliable here!

With Selenium, we can specify the exact figure we want by using the XPath selector to select only the second span element which reflects the first figure:

product_prices = browser.find_elements_by_xpath('//a/div/div[2]/div[2]/div[@*]/span[2]')

3. Search returns 50 items per page but only 15 were selected (additional complexity = high)

The Shopee website is a dynamic website, where the page elements appear dynamically only when scrolling down the page. This isn’t unusual as it allows the page to load quicker without having to immediately load all the elements (Facebook also operates the same way).

Not what I was hoping for. Also notice the empty strings in the first list. We’ll get to that later.

Nonetheless, this requires us to automate the scrolling to the bottom of the page just as you would do manually, with short wait times for all the page elements to appear.

Selenium also allows for automation of browser scrolling, but the script for this particular automation can be lengthy as you would need to replicate the manual process of scrolling down a little bit, and then waiting a few seconds for the page elements to appear, rinse and repeat until you reach the end of the page.

We can write the script as follows:

import time
scroll_pause_time = 1
while True: last_height = driver.execute_script(“return document.body.scrollHeight”)
browser.execute_script(“window.scrollTo(0, window.scrollY + 500);”)
time.sleep(SCROLL_PAUSE_TIME)
new_height = browser.execute_script(“return document.body.scrollHeight”)

if new_height == last_height:
browser.execute_script(“window.scrollTo(0, window.scrollY + 500);”)
time.sleep(scroll_pause_time)
new_height = browser.execute_script(“return document.body.scrollHeight”) if new_height == last_height:
break
else:
last_height = new_height
continue

Already we can see that the code is getting much more complex, and the automation process is also getting slower with the additional pause times.

All 50 elements scraped, finally!

4. The item name elements cannot be selected

As you can see earlier, the item names are not selectable even though they can be identified with either the class or XPath selectors and can be seen with the Chrome inspect tool. Because of this, running find_element doesn’t return the desired item names, just empty strings.

Here, admittedly I found myself on a stumbling block. A closer inspection revealed that the CSS property user-select is set to none, meaning that users cannot select the text.

We’ll need to write some Javascript code to manipulate the CSS property, a language I’m currently very unfamiliar with.

Fortunately, I found an easier way to scrape the Shopee: using Shopee’s API to query the search results.

I was incredibly lucky to come across this on the web. Not all sites will have (or will share) their API to share with you. As Shopee allows you to use their API to scrape product details directly, it is much easier to use this instead of automating the scraping process using Selenium with the following short block of code:

import requestsShopee_url = 'https://shopee.com.my'
keyword_search = 'Nescafe Gold refill 170g'
headers = {
‘User-Agent’: ‘Chrome’,
‘Referer’: ‘{}search?keyword={}’.format(Shopee_url, keyword_search)
}
url = ‘https://shopee.com.my/api/v2/search_items/?by=relevancy&keyword={}&limit=100&newest=0&order=desc&page_type=search'.format(keyword_search)# Shopee API requestr = requests.get(url, headers = headers).json()# Shopee scraping scripttitles_list = []
prices_list = []
for item in r['items']:
titles_list.append(item['name'])
prices_list.append(item['price_min'])

Next, we will create a pandas dataframe to organize all this data:

Shopee = pd.DataFrame(zip(titles_list, prices_list), columns=[‘ItemName’, ‘Price’])

Printing the output of the dataframe produces the following result:

Notice that the dataset contains a few other random items. Shopee includes other items as advertisements within the search results. Odd because the other items are completely unrelated!

As with the Lazada dataset, we will also need do conduct some cleaning with this dataset. The main things we need to do are the following:

  1. Transform the price column from integer type into a two decimal float type
  2. Remove unrelated entries from the dataset (I’m looking for coffee, not collagen eye masks!)
  3. Remove the twin packs
# Remove the ‘RM’ string from Price and change column type to float
dfS[‘Price’] = dfS[‘Price’] / 100000
# Remove false entries i.e. those which are not actually Nescafe Gold Refill 170g
dfS = dfS[dfS[‘ItemName’].str.contains(‘170g’) == True] # Poor search function Shopee!!!
# Some of the items are actually x2 packs. Remove them toodfS = dfS[dfS[‘ItemName’].str.contains(‘[2x\s]{3}|twin’, flags=re.IGNORECASE, regex=True) == False]

Now let's combine the Lazada and Shopee datasets! We do this by using the pandas concatenation method:

# Add column [‘Platform’] for each platforms
dfL[‘Platform’] = ‘Lazada’
dfS[‘Platform’] = ‘Shopee’
# Concatenate the Dataframesdf = pd.concat([dfL,dfS])

Now we finally get to compare between the two platforms. We can print the dataframe statistical features using the describe method:

print(df.groupby([‘Platform’]).describe())

We will plot the data using the same box plot we created in Part 1:

sns.set()
_ = sns.boxplot(x=’Platform’, y=’Price’, data=df)
_ = plt.title(‘Comparison of Nescafe Gold Refill 170g prices between e-commerce platforms in Malaysia’)
_ = plt.ylabel(‘Price (RM)’)
_ = plt.xlabel(‘E-commerce Platform’)
# Show the plotplt.show()
Nescafe Gold refill 170g is indeed cheaper on Shopee

And there you have it! Based on a single item comparison, it does seem that Shopee is a cheaper platform (with more items).

A few notes before I finish off:

a) It’s quite useful to conduct a price comparison between different time periods, so as to analyze the price trend of a particular item. To do this we can add a datetime column, and save it to a csv file.

import time# Add TimestampdfL[‘datetime’] = pd.Timestamp.today()
dfS[‘datetime’] = pd.Timestamp.today()
# Save dataframe to a csv filetimestamp = str(pd.Timestamp.today()).replace(":", ".")
df.to_csv('PriceComparison_{}.csv'.format(timestamp))

b) Although you can scrape for other items by simply changing the keyword_search variable, you may need to clean the dataset differently from the example shown here.

c) This example is a small dataset, and therefore the scraping and cleaning exercise was much faster.

--

--

Wilson Wong

An aspiring polymath interested in data science, automation and artificial intelligence. Didn’t turn out as intended after graduating from law school in 2009.