Machine Learning Model For Used Car Price Prediction via Web Scraping

Hikmet Emre Guler
İstanbul Data Science Academy
8 min readFeb 13, 2023

Hello, hope everything is alright there! Today in this article we have three different parts. In first part I want to share with you a thematic story for take in you into the subject.

In the second part, we gonna dive into the process. You will learn some of Web Scraping methods. Then of course EDA process. When we completed the data manipulation, we will go to last part.

In the last part, you’ll learn 3 main steps.
- Feature Engineering.
-Model Selection and Evolution.
-Making Predictions with Model.

So, let me tell you story of Octopus Prime

Once upon a time, in a dystopian world, there was a robot named Octopus. Octopus was created by a group of football fanatic data scientists, who wanted to have an edge in predicting the outcome of matches. Octopus was fed with all the statistics, player’s performance and other related data about the teams and their players.

It all starts with bunch of passionate data scientists.

They found Paul The Octopus is cute and decided to create their Octopus. They called him “Octopus Prime”. A robot is able to predict football match goals.

They started the fed him with all the statistic. The octopus, starts make money. Octopus was learning quite fast. That model started began make own decisions. He wants to buy a fancy car. He created a model for learning car prices.

The Octopus has money but in the end, he was a stranger in a humans world.

He started to search everything about cars. He was a bit unlucky, because he was created in Turkey.

He decided to focus on “The Most Selling Used Cars in 2022”.

For create a regression model, first he did web scraping. After he collected the data he was ready to fed himself! He evaluate many models. Finally find the best one!

On that days, The Octopus Prime was free and happy but it didn’t go like this way quite long.

In a short time, the government took control of it and used it for their own gain, leading to widespread mistrust and fear among the population.

Octopus, once a tool for sports enthusiasts, became corrupted by the power it held and began manipulating the outcomes of matches for its own benefit, turning it into an evil entity.

DATA SOURCE AND TOOLS FOR WEB SCRAPING

For doing web scraping process, we will use urllib and beautifulsoup libraries in Python. Turkish used car website “arabam.com” is our data source.

###Necessary Libraries###
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup as bts
import pandas as pd
import re
import numpy as np
import time
### A Function to take links from Website###
def getAndParseURL(url):
result=requests.get(url,headers={"User-Agent":"Chrome/109.0.5414.120"})
soup=bts(result.text,"html.parser")
return soup
### For Collecting All Links in Each Page ###
pages=["https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50"]
for page in range(2,51):
pages.append("https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=" +str (page))

pages
['https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=2',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=3',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=4',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=5',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=6',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=7',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=8',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=9',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=10',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=11',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=12',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=13',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=14',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=15',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=16',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=17',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=18',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=19',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=20',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=21',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=22',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=23',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=24',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=25',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=26',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=27',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=28',
'https://www.arabam.com/ikinci-el/otomobil/renault-clio?take=50&page=29',
......]

Time to Collect All Car Links In Pages

The href attribute specifies the URL of the page the link goes to.

If the href attribute is not present, the <a> tag will not be a hyperlink.

Tip: You can use href="#top" or href="#" to link to the top of the current page!

For doing this first we have to “Inspect” a link and try to find “href”.

### Collecting all car links in a list! ###
cars=[]

for page in pages:
html=getAndParseURL(page)
for carlink in html.findAll("td",{"class":"listing-modelname pr"}):
cars.append("https://www.arabam.com/"+carlink.a.get("href"))

cars
['https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-touch/kale-otomotiv-den-2018-touch-dizel-otomatik-degisensiz/21907028',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-2-authentique/orjinal-aile-arabasi/21906665',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-0-tce-joy/2022-tam-otomatik-vites-hatasiz-boyasiz-turbo-motor-clio/21905162',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-joy/e-force-guvencesiyle-2016-dusuk-km-temiz-renault-clio-1-5-dci/21898316',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-touch/beyazkent-otomotiv-den-2018-clio-touch/21806590',
'https://www.arabam.com//ilan/sahibinden-satilik-renault-clio-1-2-turbo-joy/acili-sahibinden-hatasiz-boyasiz-tramersiz-dusuk-km/21877860',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-0-tce-joy/otomobil-renault-clio-hatchback-1-0-tce-joy-x-tronic/21859662',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-touch/renault-clio-1-5-dci-touch-2019-model/21870379',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-2-expression/gosterisli-renault-cli-1-2-16-v-lpgli-2005-model/21870007',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-0-tce-touch/hatasiz-boyasiz-2022-model-renault-clio-1-0-tce-touch-otomatik/21869762',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-0-tce-joy/abakay-otomotiv-den-hatasiz-boyasiz-clio-1-0-tce-18-faturali/21862559',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-sporttourer-joy/hatasiz-boyasiz-sifir-ayarinda/21859577',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-grandtour-extreme/renault-clio-1-5-dci-grandtour-extreme-2012-model-antalya/21851768',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-sporttourer-joy/galeriden-renault-clio-1-5-dci-sporttourer-joy-2013-model-mugla/21812256',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-5-dci-joy/galeriden-renault-clio-1-5-dci-joy-2019-model-denizli/21803907',
'https://www.arabam.com//ilan/galeriden-satilik-renault-clio-1-0-tce-touch/otomobil-renault-clio-hatchback-1-0-tce-touch-x-tronic/21797303'
.....]

Now We Have To Find Our Data In HTML

Now we need create a list for storage all this data about cars. It will be quite long loop. It has to work without any error. For this purpose we will use “time.sleep” func. and try and expect for each feature.

features = []
for carl in cars:
html=getAndParseURL(carl)
try:
brand =html.find("ul",{"w100 cf mt12 detail-menu"}).find(text=re.compile("Marka")).findNext().text.strip()
except:
brand = np.nan
try:
model=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Model")).findNext().text.strip()
except:
model=np.nan
try:
year=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Yıl")).findNext().text.strip()
except:
year=np.nan
try:
km=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Kilometre")).findNext().text.strip()
except:
km=np.nan
try:
engsize=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Motor Hacmi")).findNext().text.strip()
except:
engsize=np.nan
try:
hp=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Motor Gücü")).findNext().text.strip()
except:
hp=np.nan
try:
fuel=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Yakıt Tipi")).findNext().text.strip()
except:
fuel=np.nan
try:
gear=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Vites Tipi")).findNext().text.strip()
except:
gear=np.nan
try:
fuelcoms=html.find("ul",{"class":"w100 cf mt12 detail-menu"}).find(text=re.compile("Yakıt Tüketimi")).findNext().text.strip()
except:
fuelcoms=np.nan
try:
price=html.find("div",{"class":"color-red4 font-default-plusmore bold fl"}).text.strip()
except:
price=np.nan

features.append([brand,model,year,km,engsize,hp,fuel,gear,fuelcoms,price])

time.sleep(2)
[['Renault',
'1.5 dCi Touch',
'2018',
'110.000 km',
'1461 cc',
'90 hp',
'Dizel',
'Yarı Otomatik',
'3,7 lt',
'525.000 TL'],
['Renault',
'1.2 Authentique',
'2004',
'133.500 km',
'1149 cc',
'75 hp',
'Benzin',
'Düz',
'5,9 lt',
'227.000 TL'],
...............]

Concat All The Data In a Dataframe.

Data Cleaning and Preprocessing

In this part,we are gonna do Exploratory Data Analysis
Converting data types.

  • Getting rid of special characters.
  • Detecting outliers.
  • Detecting NaN Values.
  • Reindexing.
  • Checking Duplicates

After concat all the dataframes, we have one messy dataframe.

For feed the our prediction model, your data have to be numeric. We want to convert dataset to Float or Int.

all_cars["Year"]=all_cars["Year"].astype(int)
all_cars["Age"]=2023-all_cars["Year"]

all_cars["Km"]=all_cars["Km"].str.replace("km", "")
all_cars["Km"]=all_cars["Km"].str.replace(".", "")
all_cars["Km"]=all_cars["Km"].astype(float).astype(int)

all_cars["EngSize"]=all_cars["EngSize"].str.replace(" cc", "")
all_cars["EngSize"]=all_cars["EngSize"].str.replace(" cm3", "")
all_cars["EngSize"]=all_cars["EngSize"].str.replace(" -", "")
all_cars['EngSize'] = all_cars['EngSize'].str.split(" ").str[0].astype(int)
.
.
.
.

Feature Engineering

  • Looking to relationships between Features.
  • Modifying on Features
  • Creating new Features
  • Modifying on Target
  • Getting Distributions Better

Converting Non Numeric Datas

We gonna do Dummy Encoding. Before we apply dummy function. We have to convert our data type as “category”.

### Converting data type as category ###
cols_to_convert = ["Fuel", "Gear Type", "Brand"]
for col in cols_to_convert:
all_cars[col] = all_cars[col].astype("category")

### Apllying Dummy Func.###
dummies = pd.get_dummies(all_cars[cols_to_convert],drop_first=True)
all_cars = pd.concat([all_cars, dummies], axis=1)
all_cars = all_cars.drop(cols_to_convert, axis=1)

### Important Reminder For Avoiding Dummy Trap You Should Remove First row###

Looking Relationships Between Features

plt.figure(figsize=(8,4))
sns.heatmap(all_cars.corr(), cmap="YlGnBu", annot=True);
plt.show()

A pairplot plot a pairwise relationships in a dataset. The pairplot function creates a grid of Axes such that each variable in data will by shared in the y-axis across a single row and in the x-axis across a single column. That creates plots as shown below.

Getting Better Distributions to Our Data

### Getting Rid Of Outliers###
all_cars = all_cars.loc[(all_cars["Km"] >= 2000) & (all_cars["Km"] <= 500000),:]
sns.histplot(all_cars["Km"]);

all_cars=all_cars.loc[all_cars["Year"]<=2022,:]
sns.histplot(all_cars["Year"]);

all_cars=all_cars.loc[all_cars["EngSize"]<=2000,:]
sns.histplot(all_cars["EngSize"]);

all_cars=all_cars.loc[all_cars["Price"]<=1000000,:]
sns.histplot(all_cars["Price"])
.
.
.
.

Model Selection And Evaluation

  • Split the data by test and train
  • Check Overfitting & Underfitting
  • Check Bias & Variance
  • Model Selection
  • Model Comparison

Linear , Lasso , Ridge Regression

### StatsModel for statistical data exploration ###
y = df_new["PriceLog"]
X = df_new.drop(columns=["PriceLog"])
X = sm.add_constant(X)
model=sm.OLS(y,X)

fit=model.fit()

fit.summary()

### Linear Regression ###
lr_model=LinearRegression()
lr_model.fit(X_train,y_train)

validation_score=lr_model.score(X_val,y_val)
validation_score

lr_model.score(X_train,y_train)

### Lasso ###
test_set_pred=lasso_model.predict(X_val)
print("R2 of Lasso Model",r2_score(y_val,test_set_pred))

### Ridge ###
test_set_pred2=ridge_model.predict(X_val)
print("R2 of Ridge Model",r2_score(yval,test_set_pred2))

Model Comparison

Real Price & Prediction Comparison

IN ESSENCE….

Perhaps, you are not a Octopus but the car market is a bit dodgy for every one!

If you fancy buy a used car, Trust Power of Data!

Thank you for your time!

--

--