From Web Scraping to Linear Regression Model

Fatma Zehra Sarı
İstanbul Data Science Academy
10 min readSep 6, 2022

The internet is full of data waiting to be harvested. One of the main tools to obtain the specific data we want to use is the Beautiful Soup library. In this article I will explain the basic steps to extract the data from our target website using “requests” and “Beautiful Soup” libraries, build a basic machine learning model, and check the model’s accuracy.

Step 0: URL and HTML elements

Before starting to scrape, we need priori knowledge of a website’s structure which is usually defined with HTML (HyperText Markup Language). HTML is a “markup” language because it uses a set of symbols to annotate text, images, and other content for display in a web browser. These symbols help us to find where the data are. HyperText in HTML refers to links that connect web pages to one another.

I chose kitapyurdu.com website which is a popular online bookstore in Turkiye as the website I will focus on. kitapyurdu.com website contains a browser section for searching the books and several sub-categories like best sellers or new releases. We will extract the data from the annual best sellers list in the literary-fiction (edebiyat) and non-literary sub-categories. There are 22 and 21 pages in literary-fiction and non-literary sub-categories, respectively. If we look at the URL of the first page, second page, third page, etc. in the literary-fiction category, we see the constant and changing parts of these URLs. Changing parts of URL from page to page in kitapyurdu.com are page numbers.

URL of the first page: https://www.kitapyurdu.com/index.php?route=product/best_sellers&page=1&list_id=18&filter_in_stock=1&filter_in_stock=1

URL of the second page: https://www.kitapyurdu.com/index.php?route=product/best_sellers&page=2&list_id=18&filter_in_stock=1&filter_in_stock=1

We’ll use this information to access all pages of the aforementioned sub-categories. Now let’s look at the HTML code that creates the web page. We can inspect a website using Developer Tools which can be accessed by right-clicking on the page and selecting the Inspect option. When we look at the Elements tab in the developer tools, we see a structure with clickable HTML elements.

HTML elements of the webpage

An HTML element starts with the <tagname> and end with </tagname>. Some of the basic tags are:

From https://www.codebrainer.com/blog/top-10-html-tags

After this brief introduction, we can start to scrape our website.

Step 1: Web Scraping via Beautiful Soup

To get the website’s HTML code into our Python script, we use the requests library. Before starting web scraping, we download and import this library along with other libraries we’ll employ.

import requests
from bs4 import BeautifulSoup as bts
import pandas as pd
import numpy as np

Now, we’ll define a function “getAndParseURL(url)” that takes the URL of the webpage and issues an HTTP GET request to the given URL. It retrieves the HTML data that the server sends back and stores that data in the “result” object.

The “result” object contains HTML elements and attributes scattered around. We can parse it to make the data more accessible via Beautiful Soup (bts) library. “bts” takes HTML content we scraped earlier as its input, and returns parsed HTML content.

def getAndParseURL(url):
result = requests.get(url, headers={"User-Agent":"Chrome/103.0.0.0"}) # Safari/537.36. Mozilla/5.0
soup = bts(result.text, "html.parser")
return soup

The next step is accessing the URL of each book on the each page. URL of a book is located under <div> with a class attribute that has the value “cover”. <a> tag under this attribute defines a hyperlink that contains the URL of the book.

href attribute : URL of the page the link goes to.

For each of the 22 pages (literary sub-category of bestsellers), we’ll use “getAndParseURL(page)” function to get parsed HTML content of each page and “findAll()” to access the links of the books inside the parsed HTML.

products = []
for page in pages:
html = getAndParseURL(page)
for product in html.findAll("div",{"class":"cover"}):
products.append(product.a.get("href"))
print(products[:25])
Links of the first 25 books under the literary-fiction sub-category

We will employ “find()” to access the information inside other HTML tags. We can get the text in HTML code by writing “.text”, and remove the empty space at the beginning and at the end of the string with “.strip()”.

name = html.find("h1",{"class":"pr_header__heading"}).text.strip()
print("Name of the Book: ", name)
author = html.find("a",{"class":"pr_producers__link"}).text.strip()
print("Author: ", author)
publisher = html.find("div",{"class":"pr_producers__publisher"}).text.strip()
print("Publisher: ", publisher)
release_date = html.find("div",{"class":"attributes"}).find(text=re.compile("Yayın Tarihi")).findNext().text.strip().replace(".","/")
print("Release Date: ", release_date)
purchase_info = html.find("div",{"class":"purchase-info"}).text.strip().replace(".","")
print("Purchase Info: ", purchase_info)
page_num = html.find("div",{"class":"attributes"}).find(text=re.compile("Sayfa Sayısı:")).findNext().text.strip()
print("Number of Pages: ", int(page_num))
cover = html.find("div",{"class":"attributes"}).find(text=re.compile("Cilt Tipi:")).findNext().text.strip()
print("Type of the Book Cover: ", cover)
paper_type = html.find("div",{"class":"attributes"}).find(text=re.compile("Kağıt Cinsi:")).findNext().text.strip()
print("Type of the Paper: ", paper_type)
rating = float(html.find("meta", itemprop="ratingValue").attrs["content"].replace(".",""))
print("Rate: ", rating)
rating_count = int(html.find("meta", itemprop="ratingCount").attrs["content"].replace(".",""))
print("Rating Count: ", rating_count)
review_count = int(html.find("meta", itemprop="reviewCount").attrs["content"].replace(".",""))
print("Review Count: ", review_count)
fav_count = int(html.find("span",{"id":"favorite-count"}).text.strip().replace(".",""))
print("Fav Count: ", fav_count)
to_read_list = html.find("li",{"class":"readlists__item"}).text.strip("\n").strip().replace(".","")
print("To-Read List Count: ", to_read_list)
price = float(html.find("div",{"class":"price__item"}).text.strip().replace(",","."))
print("Price: ", price)
discount = html.find("p",{"class":"info-text"}).text.replace(",",".")
print("Discount: ", discount)
manufacturer_price = float(html.find("span",{"class":"pr_price__strikeout-list"}).text.strip().replace(",","."))
print("Manufacturer Price: ", manufacturer_price)
df = pd.DataFrame.from_records(features, columns=df_columns)
df.to_pickle("./kitap_yurdu_literary.pkl")
Attributes we extracted for one of the books

We will perform the same steps for the non-literary sub-category of bestsellers, too. Then we’ll append the second dataframe we created to the first one. In the regression model we’ll construct, we are planning to predict the price of the books. So we need to put the features we get into a form that can be used as input for our model.

Step 2: Feature Engineering

As we can see, we could not obtain the “number” data for “purchase_info” and “to_read_list” attributes. To solve this issue, we can write a simple function to split the string we obtained into a list and find the “number” information inside of the list’s elements.

for i in purchase_info.split():
try:
purchase_num = int(i)
return purchase_num
except ValueError:
pass
The first 5 rows of the dataframe

After obtaining the attributes for each one of the books under “literary-fiction” and “non-literary” sub-categories, we construct a dataframe to hold all the data and can save this dataframe as a pickle object using “.to_pickle()” function of pandas library.

The “release_date” attribute is not very useful in this form. We can use it to obtain the number of days passed since the release date of the book. To do that, we’ll convert the type of release date from string to datetime object. Then we’ll obtain today’s date in the same format and find the difference between them.

today = datetime.now().date()
def time_passed(release_date):
rd = datetime.strptime(release_date, "%d/%m/%Y").date()
time_since_publ = (today-rd).days
return(time_since_publ)
The first 5 rows of the dataframe with “Time Passed Since Publication” column

Different types of covers and papers may effect the price of the book. There are 2 types of covers and 5 types of papers in our dataframe.

Types of Cover of the Book -> Karton Kapak : softcover & Ciltli: hardcover

Types of Paper -> Kitap Kağıdı: Uncooated paper & 3. Hm. Kağıt: 3. Grade Pulp Paper & 2. Hm. Kağıt: 2. Grade Pulp Paper & 1. Hm. Kağıt: 1. Grade Pulp Paper & Kuşe Kağıt: Coated paper

We can use label encoding to convert the types of covers and papers into the numeric form so as to machine-readable form. Firstly we create dictionaries that match each cover or paper type with a number, then map columns of our dataframe with these dictionaries. By the way, the numbers we assign are not randomly assigned, but based on the quality of the cover or the paper type.

book_cover_dict = { 'Karton Kapak':1, 'Ciltli':2}paper_dict = { 'Kitap Kağıdı':2,
'2. Hm. Kağıt':2,
'1. Hm. Kağıt':3,
'3. Hm. Kağıt':1,
'Kuşe Kağıt':4 }
ky["Cover of Book"] = ky["Cover of Book"].map(book_cover_dict)
ky['Paper Type'] = ky['Paper Type'].map(paper_dict)
The first 5 rows of the dataframe after label encoding step

Before the construction of the machine learning model, we’ll drop duplicate rows and the first 3 non-numeric columns from the dataframe.

The first 5 rows of the dataframe after removing the first 3 non-numeric columns

We also need to check empty rows of the columns before further steps. We can prefer to remove the empty rows or fill them with mean, median etc. of that specific column.

Summary of the dataframe

I prefer to remove 10 empty rows of Number of Pages and 2 empty rows of Time Passed Since Publication.

Dataframe after removal of empty rows

For filling the empty rows of the Manufacturer Price column we can use numbers selected from a uniform distribution over [0, mean + standard deviation) interval of the column.

def mean_std_filling(column_name):    
mean = column_name.mean()
std = column_name.std()
is_null = column_name.isna().sum()
print('Mean:', mean, 'Std:', std, 'Null:', is_null)
rand_float = np.random.uniform(0, mean + std, size = is_null)
print('Numbers:', rand_float[:25])
column_name[np.isnan(column_name)] = rand_float
column_name = column_name.astype(float)

For filling one empty row of the Type of the Paper column, we can use “2” which is the standard type of book paper.

The last version of the dataframe after feature engineering steps

Step 3: Building a linear regression model

We start with the split of dataframe as train, validation, and test sets. We use the train set for training step, then test the accuracy of the model with the validation and test sets. Scikit-learn is a famous Python library that can be used to create machine learning models. We’ll use this library for both to split the dataset and to build our model.

from sklearn.model_selection import train_test_splitX = ky_lr.loc[:,ky_lr.columns != "Price"]
y = ky_lr.Price
# Train/Test split
X_train, x_test, Y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train/Validation split
x_train, x_cv, y_train, y_cv = train_test_split(X_train, Y_train, test_size=0.2, random_state=42)
Sizes of the datasets

After importing the library, let’s train our model with the training set and check the accuracy by calculating R² and Mean Square Error metrics.

R² shows the goodness of fit. In our case, R² is too high and it means that there may be an overfitting problem. Let’s check the distribution of the 6 columns of the dataframe with Seaborn library.

Distribution of the dataframe’s columns

We can see that all 6 columns exhibit a right-skewed distribution. We’ll use the upper-quartile values of the columns to remove the outliers and try to obtain a distribution similar to the normal distribution. (Upper-quartile is the value under which 75% of data points are found when arranged in increasing order.)

def extract_whiskers(data):
median_value = np.median(data) # Median
upper_quartile = np.percentile(data, 75) # 75%
lower_quartile = np.percentile(data, 25) # 25%
iqr = upper_quartile - lower_quartile # Interquartile Range
upper = data[data<=upper_quartile+1.5*iqr].max()
print("Upper Whisker:", data[data<=upper_quartile+1.5*iqr].max()) # Max
#print("Lower Whisker:", data[data>=lower_quartile-1.5*iqr].min()) # Min
return upper2
Distribution of the dataframe’s columns after removal of outliers

We can use the log-transformation to transform skewed data to approximately conform to normality. We’ll employ the Numpy library to transform Purchase Info, Rating Count, Review Count, and Fav Count columns into log-transformed versions.

Log-transformed columns (features)

After the transformation step, let’s check the correlation between the features in our dataframe. If two features show a high correlation, change in one of them would cause change to another and so the model results to fluctuate significantly. We need to identify highly correlated features and remove them before the model training step.

Correlation heatmap of the features after log transformation

To determine the features that show high corr with other features :

corr_matrix = ky_lr.corr().abs()# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))
# Find features with correlation greater than 0.85
to_drop = [column for column in upper.columns if any(upper[column] > 0.85)]
Features that show high correlation with other features

Let’s drop the “Manufacturer Price”, “Review Count Log” and “Rating Count Log” columns, and train our model again, then check the accuracy for both validation and test sets.

We can say that we have overcome the overfitting problem, because the accuracy of the validation set is reasonable, and the accuracy of our holdout set (test set) is high enough to continue with this model. So our model generalizes well. As a final step, let’s find out how much each feature contributes to the model prediction.

Type of the Book Cover” is the feature that has the most impact on the “Price” prediction.

--

--