Building Simple Content-Based Recommendation System for eCommerce: Part I-Web Scraping

Joe Wilson
5 min readSep 9, 2021

By: Joe Wilson, a data science and machine learning enthusiast.

This tutorial is a two part series covering web scraping and building a simple content-based machine learning recommendation system using products’ metadata.

Okay, let’s say you want to purchase an item from a website. You entered the name of the item and bam, other items are suggested for you. Take another scenario, you watched a movie on Netflix, and bam, other movies are suggested for you based on the movie you previously watched. Sounds familiar? This isn’t magic! The suggestions are made using a machine learning recommendation system running in the backend of the website you’re browsing or the Netflix you’re accessing.

The overarching goal of the tutorial is to walk through how I built a simple machine learning content-based product recommendation system. However, Part-I of the tutorial covers web scraping (which we are dealing with here) for data collection. Building the recommendation system will be covered in Part-II of the tutorial.

Before going any further, let explain a few things:

A recommendation system is used to predict products a user (customer) might be interested in based on explicit and implicit data collected from the user. Explicit data are provided by users (example, rating a product). But the problem with collecting explicit data is that not many customers rate a product after using it (how often do you rate a movie on Netflix or a product after ordering from Amazon?). Implicit data are collected based on observing users’ behavior such as time spent on viewing a particular item or bookmarking the item. The goal of collecting implicit data is to make an inference from the observed behavior of the user and predict the rating the user would assign to an item.

There are different kinds of recommendation systems among which includes content-based, collaborative filtering and hybrid recommendation systems. To learn more about each recommendation system, click here.

This tutorial (Part-1) is focused on web scraping for data collection, let’s go straight into that. In Part-2 of the tutorial, we will take a deep dive into the different kinds of recommendation systems including the concepts applied in building a machine learning content-based recommendation system.

The data used for the entire tutorial were collected from the Whiskey Exchange website. On the Whiskey Exchange website (see picture below), there are varieties of whiskies. For the sake of this tutorial, the data is scrapped from 6 pages on Japanese Whiskies from www.thewhiskeyexchange.com.

There are 6 features scraped from the Japanese section on the website:

  1. Whiskey Name
  2. Price
  3. Brand
  4. Rating
  5. About (Description)
  6. Customer’s Review

The data were scraped from the website, written, and saved into a comma-separated file (CSV) using Python codes in Jupyter Notebook. Learning how to write data into a CSV file in python is one important thing to know especially for web scraping tasks. See the codes below for scraping the data and saving it into a CSV file.

First there are key libraries in python that need to be loaded (if you don’t have the libraries, install and load them for the codes to work).

# Importing Key Libraries
from bs4 import BeautifulSoup
import requests
import pandas as pd
import csv
import matplotlib.pyplot as plt
%matplotlib inline

Now that the libraries are loaded, it’s time to write the python codes to scrape the data from www.whiskeyexchange.com and write the data into a CSV file.

baseurl = “https://www.thewhiskyexchange.com"

headers = {‘User-Agent’: ‘Mozilla/5.0 (Macintosh; Intel Mac OS X 11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36’}

It is a good practice to use user-agent in web scraping. Including a user-agent in the header of a request makes it easier for the server (website your are requesting information from) to identify your device and respond appropriately. Without the proper user-agent included in your request header, some websites will deny your request. You can learn more about user-agent and use the appropriate user-agent based on your device by clicking here.

Before beginning the web scraping tasks, it is a good practice to inspect the pages on the website you want to scrape. Inspecting the pages will allow you to identify where contents are located in the html codes. To do this,

  1. If you’re using a mouse, right click on the page you want to collect the data and scroll down to inspect.
  2. Pay keen attention to things like “href”, “div”, “class,” etc., (notice them in the codes provided as well). There are many resources out there explaining in details what each of these things mean.
  3. You should see something like the image below (while inspecting a web page). You can apply this concept on any website you’re interesting in scraping.

To begin, create empty lists where the data scraped from the website would be parsed.

productlinks = []
t={}
data=[]
c=0

See the rest of complete codes for the web scraping task:

for x in range(1,6):
k = requests.get(‘https://www.thewhiskyexchange.com/c/35/japanese-whisky?pg={}&psize=24&sort=pasc'.format(x)).text
soup=BeautifulSoup(k,’html.parser’)
csv_file = open(‘JapaneseWhisky_9.csv’, ‘w’)
csv_writer = csv.writer(csv_file)
csv_writer.writerow([‘name’,’price’,’brand’,’rating’,’about’,’review’])
productlist = soup.find_all(“li”,{“class”:”product-grid__item”})

for product in productlist:
link = product.find(“a”,{“class”:”product-card”}).get(‘href’)
productlinks.append(baseurl + link)

for link in productlinks:
f = requests.get(link,headers=headers).text
soup=BeautifulSoup(f,’html.parser’)

try:
price=soup.find(“p”,{“class”:”product-action__price”}).text.replace(‘\n’,””)
except:
price = ‘no price’

try:
about=soup.find(“div”,{“class”:”product main__description”}).text.replace(‘\n’,””)
except:
about=’no price’

try:
rating = soup.find(“div”,{“class”:”review-overview”}).text.replace(‘\n’,””)
except:
rating=’no rating’

try:
name=soup.find(“h1”,{“class”:”product-main__name”}).text.replace(‘\n’,””)
except:
name=’no name’

try:
review = soup.find(“p”,{“class”:”review-list__copy”}).text.replace(‘\n’, “”)
except:
review=’no review’
try:
brand = soup.find(“ul”,{“class”:”product-main__meta”}).text.replace(‘\n’,””)
except:
brand=’no brand’

whisky = {“name”:name,”price”:price,”brand”:brand,”rating”:rating,”about”:about,”review”:review”}

data.append(whisky)
c=c+1
print(“completed”,c)
csv_writer.writerow([name, price, brand, rating, about, review])

csv_file.close()

There are two primary (although each line of code is doing something) things that happened in the codes above.

  1. The data scrapped from the website are stored in a file called “data.”
  2. The data scrapped from the website are stored in a CSV file on your local device (that’s if you run the code) labelled as “JapaneseWhisky_9.csv.”

The below line of code converts the scraped data (note the CSV file) into a data frame in pandas for analyses. The CSV file will be saved locally on your machine.

df = pd.DataFrame(data)

You can also load the CSV file (JapaneseWhisky_9.csv) into a pandas data frame using the below line of code:

df = pd.read_csv(‘JapaneseWhisky_9.csv’)

Finally, let’s check out the first fifteen rows of the data we just scraped from the website using the code below:

df.head(15)

You should see something that looks like the picture below.

This is how you scrape data from a website. Part-2 of the tutorial will focus on performing some exploratory data analysis (EDA), applying some concepts in Natural Language Processing (NLP) before building the actual recommendation system. I hope this was helpful especially for those that are completely new to web scraping.

--

--

Joe Wilson

Joe is a data science and machine learning enthusiast and has cross-cutting expertise in international development, health economics and entrepreneurship.