First Steps to Web Scraping, Regression and Machine Learning

14 min readSep 1, 2023

Hello everybody.

We completed the second project as Team-2 in the Data Science Bootcamp of Istanbul Data Science Academy. In this article, I would like to introduce you to our project, where I took my first steps into Web Scraping, Regression and Machine Learning.

In this project, which we simply call ‘Used Cars Price Prediction’, we were asked to develop a machine learning model using linear regression after performing exploratory data analysis and feature engineering on the data we obtained by web scraping. And we decided to scrape our data from arabam.com, a local used car sales website.

Web Scraping

Tools

We used Jupyter Notebook as our workspace, Requests and BeautifulSoup to scrape our data from the website, Numpy and Pandas to transform our data into a data frame.

Preliminaries

First of all, we created our getAndParseURL function that we will use to send requests to websites in the next steps.

def getAndParseURL(url):
    result = requests.get(url, headers={"User-Agent":"Chrome/115.0.5790.170 "})
    soup = bts(result.text, 'html.parser')
    return soup

Then we put together the links to the pages listing the web pages from which we will scrape our data.

pages = ["https://www.arabam.com/ikinci-el/otomobil?take=50"]
for page in range(2,51):
    pages.append("https://www.arabam.com/ikinci-el/otomobil?take=50&page=" + str(page))
pages

We have gathered all the advertisement links listed on each page we have collected above, so that we can then send a request to all of these links with a for loop.

links = []
for page in pages:
    html = getAndParseURL(page)
    for rslt in html.findAll("a",{"class":"smallest-text-minus ovh"}):
        links.append('https://www.arabam.com/' + rslt.get('href'))
links

Scraping

And now, it’s time for web scraping. We asked our for loop to add each car feature from each car advert to our result list as a variable and then convert it to a data frame. Also, if it cannot access the data of a feature, it will assign the value of that variable as NaN.

result = []
for rslt in links:
    html = getAndParseURL(rslt)
    try:
        price = html.find("div", {"class":"product-properties"}).find(string=re.compile("Fiyat")).findNext().text.replace('TL', '').replace('.', '').strip()
    except:
        price = np.nan
    try:
        make = html.find("div", {"class":"product-properties"}).find(string=re.compile("Marka")).findNext().text.strip()
    except:
        make = np.nan
...
...
...

Since we have 2500 links in our links list and 22 features that we want to scrape in each link, we will expect 2500 rows and 22 columns in our data frame.

...
...
...
    try:
        painted_changed = html.find("div", {"class":"product-properties"}).find(string=re.compile("Boya-değişen")).findNext().text.strip()
    except:
        painted_changed = np.nan
        
    result.append([price, make, series, model, year, km, transmission, fuel, body_type, warranty, from_, wheel_drive, cylinder_number, torque, engine_capacity, engine_power, max_power, min_power, acceleration, max_speed, average_fuel_consumption, painted_changed])
columns = ['price_try', 'make', 'series', 'model', 'year', 'km', 'transmission', 'fuel', 'body_type', 'warranty', 'from_', 'wheel_drive', 'cylinder_number', 'torque_nm', 'engine_capacity_cc', 'engine_power_hp', 'max_power_rpm', 'min_power_rpm', 'acceleration_0to100_sec', 'max_speed_kmh', 'average_fuel_consumption_lt', 'painted_changed']
df = pd.DataFrame.from_records(result, columns=columns)

And here is how our data frame looks like.

Finally, we transfer the last 1000 rows of our data frame to a new data frame for the prediction phase of our machine learning model and save both data frames as csv files.

train = df.iloc[:1500]
test = df.iloc[1500:]
train.to_csv('train_arabam.csv')
test.to_csv('test_arabam.csv')

EDA & Feature Engineering

Tools

We used Jupyter Notebook as our workspace, Numpy and Pandas to clean and edit our data, seaborn to visualise our data, and statsmodels to statistically review our data.

Cleaning and Editing the Numerical Data

After importing our datasets that we have previously saved as csv files, we want to take a look at all our columns and the correlation values between our numeric columns. But we see that some columns that we expect to contain numeric values are in object data type. This is not what we expect to see.

Let’s take a look at the unique values of the engine_capacity_cc column.

To convert the values contained in this column to integer data type, we will need to make a few edits. Firstly, we extract only the numeric parts of all values using the extract() function of pandas.

train['engine_capacity_cc'] = train['engine_capacity_cc'].str.extract('(\d+)')

When we want to convert the data type of the column to integer using the astype() method of pandas, we encounter an error. Because there are null values in our column and they cannot be converted to integer data type. After a brief examination of the other columns, we decide that it makes the most sense to use the model column to fill in the null values in the engine_capacity_cc column, so we manually write a dictionary and use pandas’ map() function to fill in the null values.

engine_capacity_dict = {
    '1.2 Twinport Enjoy': '1200',
    'Ultimate': '1520',
    '1.4 Authentique': '1400',
    ...
    ...
    ...
    '1.6 TDCi Titanium': '1600',
    '4S Performance Plus': '1520'
}
train['engine_capacity_cc'] = train['engine_capacity_cc'].fillna(train['model'].map(engine_capacity_dict))

Now we can see that all values in our column are of integer data type.

Let’s take a look at the unique values of another column, cylinder_number.

A little research on the Internet reveals that the number of cylinders is positively correlated with engine capacity. So we write a for loop to fill the null values in our column accordingly.

cylinder_list= []
for index, row in train.iterrows():
    if row['engine_capacity_cc'] >= 2997:
        cylinder_list.append(8)
    elif row['engine_capacity_cc'] >= 1991:
        cylinder_list.append(6)
    elif row['engine_capacity_cc'] >= 1984:
        cylinder_list.append(5)
    elif row['engine_capacity_cc'] >= 998:
        cylinder_list.append(4)
    elif row['engine_capacity_cc'] >= 898:
        cylinder_list.append(3)
    else:
        cylinder_list.append(row['cylinder_number'])
cylinder_series = pd.Series(cylinder_list)
train['cylinder_number'] = train['cylinder_number'].fillna(cylinder_series)

Then we apply the astype() function and convert all values in our column to integer data type.

We didn’t need to edit the year and price_try columns much, except to convert their values to integer data type.

train['year'] = train['year'].str.extract('(\d+)')
train['year'] = train['year'].astype(int)

train['price_try'] = train['price_try'].str.extract('(\d+)')
train['price_try'] = train['price_try'].astype(int)

Since the other columns containing numeric values also have some correlation between each other, we filled the null values in them with a similar logic. Therefore, I did not need to explain them here in order not to prolong my article and bore you. You can find the Github link at the end of the article where you can check out my full notebook.

Cleaning and Editing the Categorical Data

Now it’s time to organize our columns with categorical data since we have plenty of them.

Let’s start with make column. Brand is an important feature in determining car prices. However, since there are so many brands in our dataset, we grouped some of them under ‘other’ to avoid having too many columns when we create the dummy variables.

make_counts = train['make'].value_counts()
other_makes = list(make_counts[make_counts <= 30].index)
train['make'] = train['make'].replace(other_makes, 'Other')

We have grouped some values under ‘other’ in the series column as well, but we have made one more adjustment to show which series belongs to which brand. We also translated some Turkish values into English.

series_counts = train['series'].value_counts()
other_series = list(series_counts[series_counts <20].index)
train['series'] = train['series'].replace(other_series, 'Other')

series_list= []
for index, row in train.iterrows():
    if row['series'] == 'Other':
        series_list.append('Other ' + row['make'])
    else:
        series_list.append(row['series'])
train['series'] = series_list

train['series'] = train['series'].apply(lambda x : x.replace("3 Serisi","3 Series"))
train['series'] = train['series'].apply(lambda x : x.replace("5 Serisi","5 Series"))

The model column that we use to fill the null values in the engine_capacity_cc column also has too many categorical unique values. This causes us to run into the ‘too many columns’ problem again when creating the dummy variables. Also, since a lot of the information in the other columns is related to the values in this column, we don’t need this column anymore.

train.drop(columns='model', inplace=True)

Since we encountered similar problems with the other columns containing categorical values, we filled the null values in them with a similar logic. In some of them, we translated Turkish values into English, in some of them we grouped some values under the name ‘other’, in some of them we filled the null values with the most common value because we could not establish a relationship with other columns or anything else. So, again, I didn’t feel the need to describe it all here, but you can take a look at my notebook in its entirety if you wish.

Finally, we dropped duplicate rows from our dataset and saved it as a csv file to use in the feature engineering phase. We also performed similar operations for the test dataset, but there was no difference worth mentioning here, except that we dropped the price_try column of the test dataset that we will use in the prediction phase.

train.drop_duplicates(inplace=True)
train = train.reset_index(drop=True)
train.to_csv('arabam_train.csv')

Please take a look at what we expect to see.

Feature Engineering

Let’s proceed to feature engineering, where we will put our analysis and coding skills to to another level. After importing our datasets from arabam_train.csv and arabam_test.csv files, we trained a simple linear regression model with price_try column as target and year, km, engine_capacity_cc columns as features. But as expected, we got a very low r2 score, because we had a lot of work ahead of us.

We start by looking at the distribution of the year column. It’s not perfect, but it doesn’t look bad either.

For ease of scaling, we convert the year to age. We see the same graph inverted.

train['year'] = 2023 - train['year']
test['year'] = 2023 - test['year']

Now let’s look at the boxplots to see if there are any outliers.

There are too few outliers for us to rule out. To see the limits of the outliers, we define a function where the upper quartile is 75 and the lower quartile is 25.

def extract_whiskers(data, whisker=1.5):
    median_value = np.median(data)
    upper_quartile = np.percentile(data, 75)
    lower_quartile = np.percentile(data, 25)

    iqr = upper_quartile - lower_quartile
    
    upper_whisker = data[data<=upper_quartile+whisker*iqr].max()
    lower_whisker = data[data>=lower_quartile-whisker*iqr].min()
    
    print("Upper Whisker:", upper_whisker)
    print("Lower Whisker:", lower_whisker)

When we apply the function to the year column, we see that the upper whisker is 32 and the lower whisker is 0. And when we remove rows with year column less than 32 from the train dataset, we see that the number of rows decreases from 1499 to 1479. Similarly, the test dataset decreases from 999 to 989. It is not a big loss for us. Let’s see the boxplots and distributions again now. It’s much better now.

Now let’s take a look at the distribution of the values in the price_try column, which is only in our train dataset. Here we see a positive skew.

We take the logarithm of all values to reduce the skewness a bit. Now there is a negative skew, but at least it is better than before.

train['price_try_log'] = np.log(train['price_try'])
train.drop(columns='price_try', inplace=True)

We applied similar operations to other columns containing numeric values, such as removing whiskers and taking logarithms of the values. We also applied both of these operations to some columns when necessary, and left some columns untouched. Because extracting whiskers in some columns would result in the loss of some unique values. As always, you can check out my full notebook for more.

Now let’s take another look at the correlation heatmap of our columns containing numerical values.

It is a much larger map than the initial one, but we have some problems here. Some features have almost no impact on our target, and some of them have too little impact to be useful. And some features are positively or negatively correlated with each other, which will cause us to face the problem of multicollinearity. We have to say goodbye to all the columns that have these problems.

datasets = [train, test]
for dataset in datasets:
    dataset.drop(columns=['min_power_rpm_log', 'min_power_rpm',
                          'max_power_rpm_log', 'max_power_rpm',
                          'cylinder_number', 'engine_capacity_cc',
                          'engine_capacity_cc_log', 'torque_nm',
                          'engine_power_hp', 'acceleration_0to100_sec',
                          'average_fuel_consumption_lt'], inplace=True)

It looks better now.

Let’s look at the same information from a different perspective.

Now let’s take a look at the OLS Regression Results we generated using statmodels. What is important for us at this stage is that the R-squared and Adj. R-squared scores are close to each other and high enough. In addition, low p-values indicate that the relevant features do not affect the target by chance.

Now it’s time to convert categorical data that we cannot directly use in machine learning modeling into numerical data. We will use label encoding for columns containing categorical data that are related or dominate each other. Let’s see the transmission column as an example.

transmission_dict = {
    'Manual': '0',
    'Automatic': '2',
    'Semiautomatic': '1'
}
train['transmission'] = train.transmission.map(transmission_dict)
train['transmission'] = train['transmission'].astype(int)

We used label encoding for most of our categorical features on both datasets. For make, series and body_type features, it made more sense to create dummy variables using one-hot encoding.

train = pd.get_dummies(train, columns=['make', 'series', 'body_type'], drop_first=True)
train = train.reset_index(drop=True)

test = pd.get_dummies(test, columns=['make', 'series', 'body_type'], drop_first=True)
test = test.reset_index(drop=True)

Now let’s take another look at our correlation heatmap.

There are so many features to review and use here. This is not what we expected. After a little investigation, we see that the make features are almost ineffective on the target and also cause multicollinearity problems because they are highly correlated with the series features. Therefore, it is time to get rid of the make column. We also drop the fuel column as we see that it is correlated with some features and has a low impact on target.

datasets = [train, test]
for dataset in datasets:
    dataset.drop(columns=['make_BMW', 'make_Fiat', 'make_Ford', 'make_Honda',
                          'make_Hyundai', 'make_Mercedes - Benz', 'make_Opel',
                          'make_Other', 'make_Peugeot', 'make_Renault',
                          'make_Seat', 'make_Tofaş', 'make_Toyota', 'make_Volkswagen', 'fuel'], inplace=True)

Let’s take one last look at our correlation heatmap. It’s not perfect, but it seems more useful now.

Modelling

Tools

We used Jupyter Notebook as our workspace, Numpy and Pandas to organise our data, matplotlib to visualise our data and scikit learn in different ways for splitting, training, scaling, regularisation, testing, cross-validation, and prediction.

Basic Linear Regression and Scaling

First of all, we split our data into 60% for training, 20% for validation and 20% for testing.

X = train.drop(columns='price_try_log')
y = train.price_try_log

X_train, x_test, Y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
x_train, x_cv, y_train, y_cv = train_test_split(X_train, Y_train, test_size=0.25, random_state=42)

Then we build and train a simple linear regression model.

lreg = LinearRegression()
lreg.fit(x_train,y_train)
pred = lreg.predict(x_train)
mse = np.mean((pred - y_train)**2)

print("Train Score: ", lreg.score(x_train, y_train))
print("MSE: ", mse)

lreg = LinearRegression()
lreg.fit(x_train,y_train)
pred = lreg.predict(x_cv)
mse = np.mean((pred - y_cv)**2)

print("Validation Score: ", lreg.score(x_cv, y_cv))
print("MSE: ", mse)

We then scale our data with RobustScaler.

from sklearn.preprocessing import RobustScaler
lreg2 = LinearRegression()
robust_scale = RobustScaler()
x_train_rs = robust_scale.fit_transform(x_train.values)
x_cv_rs = robust_scale.fit_transform(x_cv.values)
lreg2.fit(x_train_rs,y_train)
pred = lreg2.predict(x_cv_rs)
mse = np.mean((pred - y_cv)**2)

print("Validation Score: ", lreg2.score(x_cv_rs, y_cv))
print("MSE: ", mse)

Before and after scaling our data we got around 0.91 r2 score, which is quite fine, with a basic linear regression model. At first glance, scaling may seem ineffective, but we will see its effect in the regularization and cross-validation stages. And here are our model coefficients. Some of them are a bit high, but not too bad.

Regularisation

First we try using Ridge, a popular regularization technique.

from sklearn.linear_model import Ridge
ridgeReg = Ridge(alpha=10)
ridgeReg.fit(x_train_rs,y_train)
pred = ridgeReg.predict(x_cv_rs)
mse = np.mean((pred - y_cv)**2)

print("Validation Score: ", ridgeReg.score(x_cv_rs, y_cv))
print("MSE: ", mse)

Then we set up a for loop to determine the alpha value that will give us the best results.

from sklearn.metrics import r2_score
alphalist = 10**(np.linspace(0,2,200))
err_vec_val = np.zeros(len(alphalist))
err_vec_train = np.zeros(len(alphalist))

for index, curr_alpha in enumerate(alphalist):
    ridge = Ridge(alpha = curr_alpha)
    ridge.fit(x_train_rs, y_train)
    val_set_pred = ridge.predict(x_cv_rs)
    err_vec_val[index] = r2_score(y_cv, val_set_pred)

plt.plot(alphalist, err_vec_val);

In our model using the Ridge technique, we get the highest r2 score of 0.90 with an alpha value of 1. Also with Lasso, another popular technique, the highest r2 score we can achieve is 0.91.

Testing

Recall the test dataset we allocated during the web scraping phase. So far we have edited it along with the train dataset, and now it’s time to use it. We train our model with train dataset and test it with test dataset. And again we get a nice r2 score of 0.91.

lreg3 = LinearRegression()
X_train_rs = robust_scale.fit_transform(X_train.values)
x_test_rs = robust_scale.fit_transform(x_test.values)
lreg3.fit(X_train_rs,Y_train)
pred = lreg3.predict(x_test_rs)
mse = np.mean((pred - y_test)**2)

print("Test Score: ", lreg3.score(x_test_rs, y_test))
print("MSE: ", mse)

Cross-Validation

And now, we are about to implement cross-validation, one of the most important stages of the machine learning modeling process. Before we split our Train dataset into 60%, 20%, 20%, we had split it into 80% and 20%. Here we do cross-validation by dividing that 80% part into 10 parts. We also do the same for linear regression, Ridge and Laslo models separately.

from sklearn.model_selection import cross_val_score
lr = LinearRegression()
lr_cv = cross_val_score(lr, X_train, Y_train, cv=10, scoring='r2')

ridge = Ridge(alpha=1)
X_train_scaled = robust_scale.fit_transform(X_train.values)
ridge_cv = cross_val_score(ridge, X_train_scaled, Y_train, cv=10, scoring='r2')

lasso = Lasso(alpha=0)
X_train_scaled = robust_scale.fit_transform(X_train.values)
lasso_cv = cross_val_score(lasso, X_train_scaled, Y_train, cv=10, scoring='r2')

Here are the means and standard deviations of our r2 scores after cross-validation.

Prediction

And the last stage of the modeling process. We can now use our test dataset for our car price predictions. Since we took the logarithms of the values in the price_try column in the feature-engineering phase, we now need to revert them back.

test_rs = robust_scale.fit_transform(test.values)
np.exp(lreg3.predict(test_rs))

Here are some examples of our model’s predictions.

Finally, we wanted to see the predictions of our model on a data frame.

Conclusion

In this project, where I took the first steps into machine learning, I am happy that I was able to apply what we learned in the course rather than achieving r2 scores around 85–90%. What excited me the most about this project was that it made me realise that if I can do these things at this level, there is no limit to what I can do as I learn more.

Thanks to Everybody

Thank you all for sparing your valuable time to read my article.

Please visit my GitHub repository for additional sources related to our project such as project notebooks and csv files: https://github.com/salimkilinc/istdsa_project02

First Steps to Web Scraping, Regression and Machine Learning

Web Scraping

Tools

Preliminaries

Scraping

EDA & Feature Engineering

Tools

Cleaning and Editing the Numerical Data

Cleaning and Editing the Categorical Data

Feature Engineering

Modelling

Tools

Basic Linear Regression and Scaling

Regularisation

Testing

Cross-Validation

Prediction

Conclusion

Thanks to Everybody

Written by Salim Kılınç