Car Price Prediction Using Regression—Developing And Deploying A Machine Learning Web App

Kamen Damov
9 min readJul 16, 2022

--

Introduction

If I ask someone: What are the biggest factors to determine the price of a car? I’m positive that the response would be, its mileage and the year it was produced. That said, is it possible to create a model that predicts the price of a used car without these two highly correlated attributes? The present article will go through the development process of a machine learning web app that predicts the price of a used car without the two most obvious attributes.

Data

Here is the link to the kaggle page of the dataset: https://www.kaggle.com/datasets/hellbuoy/car-price-prediction.

Data exploration

I will only document the relevant findings in the data exploration. If you would like to see the full data exploration, the exploratory data analysis notebook is in the github repository (linked below).

First let’s see the head of the dataframe, with its columns:

Let’s see which attributes are the most correlated with the price, by creating a heatmap to see which numerical attributes are less correlated with the price:

As we can see, some attributes are not correlated to the price variable (with a r<0.15). Let’s drop them.

Feature engineering

The car name column includes the brand and model. Let’s split the column the and keep the brand only:

df['CarName'] = df['CarName'].apply(lambda x: x.split()[0])

Ouput:

We see there are some typos, let’s fix them:

df['CarName'] = df['CarName'].str.lower()df['CarName']=df['CarName'].replace({'vw':'volkswagen','vokswagen':'volkswagen','toyouta':'toyota','maxda':'mazda','porcshce':'porsche'})

One hot encoding

To fit data to any machine learning model, we have to have a uniform input. In other words, we can’t have categorical attributes mixed in with continuous attributes. The categorical columns need to be modified to dummy columns. The gist of this manipulation is to produce one column per unique value in a categorical column. The presence of a certain value in a data point (or row) will have 1 as a value and 0 for the other categories:

Let’s do this. First of all, let’s split our data-set in two categories. The categorical columns, and the numerical columns. Here’s how it’s done:

categorical = []
numerical = []
for col in df.columns:
if df[col].dtypes == 'object':
categorical.append(col)
else:
numerical.append(col)

We can now create dummy columns with the columns in the list named “categorical”, and concatenate it to the numerical columns.

#Create dummies, and numerical features
x1 = pd.get_dummies(df[categorical], drop_first = False)
x2 = df[numerical]
X = pd.concat([x2,x1], axis = 1)
X.drop('price', axis = 1, inplace = True)

Super! By dropping the price attribute, we now have our X data that is ready to be fitted to the model.

Before proceeding to the model training and testing, let’s see the distribution of our y value, the price.

sns.histplot(data=df, x="price", kde=True)

The distribution is skewed. This is unsettling. As this is a regression problem, and given the shape of our distribution, we risk having negative values as an output if we leave the y values untouched. Analytically, the distribution is skewed because we have a low frequency of large data. If we apply the logarithm to the y data, we will drive down the larger data points closer to the mean, while the smaller data points won’t be affected as much. Our shape will then resemble a normal distribution.

Let’s implement this

#Fix the skewed distribution
y=np.log(y)
y=y.values
sns.histplot(data=df, x=y, kde=True)

Better! Given that we applied the logarithm on the y values, we will need to apply the exponentional function to revert back to the correct unit once the model outputs a prediction. We’re ready to get to the machine learning part of the project.

Machine Learning

Train/test split and fitting

Let’s split our training and testing data:

#Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.333, random_state=1)

As we have a regression problem, we have to scale our data, before fitting to the model:

#Scale
from sklearn.preprocessing import StandardScaler
sc= StandardScaler()

X_train[:, :(len(x1.columns))]= sc.fit_transform(X_train[:, :(len(x1.columns))])
X_test[:, :(len(x1.columns))]= sc.transform(X_test[:, :(len(x1.columns))])

Model exploration

Given that our dataset is small (around 250 data points), I intuitively believe a Lasso or Ridge regression will give us the best results. The underlying mechanism is that knowing we don’t have a lot of training data, the line of best fit might be overfitted during the training process. This means that the line would fit well on the training data (low bias), but will predict poorly on new data (high variance). By using a Ridge or Lasso regression, we “trade” some bias, to reduce the variance of the model. In other words, we alleviate sensitivity to the training data when producing the line of best fit. This means that we voluntarily increase the error of the line of best fit during the training process, so it can perform better on unseen data, and thus, drive better long term predictions. Here’s a video to have a sneak peak of the math behind these models (link). This being said, we will still test out 4 models and see which one performs the best.

  1. Lasso Regression
  2. Ridge Regression
  3. Random Forest Regressor
  4. XGBoost Regressor

Let’s import them from scikit-learn:

#Regression Models
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb

Pipeline

We will create a pipeline to test and tune the models. Here are the steps:

  1. Test out each model with randomly chosen parameters.
  2. Tune hyperparameters of each model, using a grid search.
  3. Re-test model with the best parameters found.

Let’s import grid search from scikit-learn:

from sklearn.model_selection import GridSearchCV

Now, let’s create a functions that will print out the metrics (R-squared, Root mean-squared error, and Mean absolute error) for each model fitted:

def metrics(model):
res_r2 = []
res_RMSE = []
res_MSE = []
model.fit(X_train, y_train)
Y_pred = model.predict(X_test)
#Compute r squared
r2 = round(r2_score(y_test, Y_pred),4)
print( 'R2_Score: ', r2)
res_r2.append(r2)
#Compute root mean squared error
rmse = round(mean_squared_error(np.exp(y_test),np.exp(Y_pred), squared=False), 2)
print("RMSE: ",rmse)
res_RMSE.append(rmse)
#Compute mean absolute error
mse = round(mean_absolute_error(np.exp(y_test),np.exp(Y_pred)), 2)
print("MAE: ", mse)
res_MSE.append(mse)

Here’s the pipeline that was mentioned above. I will deconstruct the steps of the pipeline as we go:

models={
'rfr':RandomForestRegressor(bootstrap=False, max_depth=15, max_features='sqrt', min_samples_split=2, n_estimators=100),

'lasso':Lasso(alpha=0.005, fit_intercept=True),

'ridge':Ridge(alpha = 10, fit_intercept=True),
'xgb':xgb.XGBRegressor(bootstrap=True, max_depth=2, max_features = 'auto', min_sample_split = 2, n_estimators = 100)
}
for mod in models:
if mod == 'rfr' or mod == 'xgb':
print('Untuned metrics for: ', mod)
metrics(models[mod])
print('\n')
print('Starting grid search for: ', mod)
params = {
"n_estimators" : [10,100, 1000, 2000, 4000, 6000],
"max_features" : ["auto", "sqrt", "log2"],
"max_depth" : [2, 4, 8, 12, 15],
"min_samples_split" : [2,4,8],
"bootstrap": [True, False],
}
if mod == 'rfr':
rfr = RandomForestRegressor()
grid = GridSearchCV(rfr, params, verbose=5, cv=2)
grid.fit(X_train, y_train)
print("Best score: ", grid.best_score_ )
print("Best: params", grid.best_params_)
else:
xgboost = xgb.XGBRegressor()
grid = GridSearchCV(xgboost, params, verbose=5, cv=2)
grid.fit(X_train, y_train)
print("Best score: ", grid.best_score_ )
print("Best: params", grid.best_params_)
else:
print('Untuned metrics for: ', mod)
metrics(models[mod])
print('\n')
print('Starting grid search for: ', mod)
params = {
"alpha": [0.005, 0.05, 0.1, 1, 10, 100, 290, 500],
"fit_intercept": [True, False]
}
if mod == 'lasso':
lasso = Lasso()
grid = GridSearchCV(lasso, params, verbose = 5, cv = 2)
grid.fit(X_train, y_train)
print("Best score: ", grid.best_score_ )
print("Best: params", grid.best_params_)
else:
ridge = Ridge()
grid = GridSearchCV(ridge, params, verbose = 5, cv = 2)
grid.fit(X_train, y_train)
print("Best score: ", grid.best_score_ )
print("Best: params", grid.best_params_)

Here are the results of the randomly tuned models:

With the untuned parameters, we see pretty equivalent r2 scores, but the Lasso has the smallest error. So far, Lasso wins.

Let’s see the results of the grid search to find the best params for each model:

Now let’s apply these parameters to each model, and see the results:

Even though we improved the models’ metrics, the Lasso regression is the one that has the best result. Let’s pickle it and use it in our app.

lasso_reg = Lasso(alpha = 0.005, fit_intercept = True)
pickle.dump(lasso_reg, open('model.pkl','wb'))

App Development

We are going to use the gradio library to develop the web app. Let’s talk how the app should work:

  1. User inputs data in the form
  2. The data is processed (one hot encoding)
  3. The data is fitted to the model
  4. The price predicted and presented to the user

First, let’s import the original data-set, and the data-set with the dummy variables, and keep their respective columns. The idea is to create a data-frame with one row (the data from the user), and one hot encode this one line data-frame to fit the model.

#Columns of the df
df = pd.read_csv('df_columns')
df.drop(['Unnamed: 0','price'], axis = 1, inplace=True)
cols = df.columns
#Dummy columns of the dummy df
dummy = pd.read_csv('dummy_df')
dummy.drop('Unnamed: 0', axis = 1, inplace=True)
cols_to_use = dummy.columns

Now let’s create the values we will put as options (for the categorical attributes) in the application.

#Create the values in the app
# Capitalizing first letter of cars
cars = df['CarName'].unique().tolist()
carNameCap = []
for col in cars:
carNameCap.append(col.capitalize())
#For fuel
fuel = df['fueltype'].unique().tolist()
fuelCap = []
for fu in fuel:
fuelCap.append(fu.capitalize())
#For carbod, engine type, fuel systems
carb = df['carbody'].unique().tolist()
engtype = df['enginetype'].unique().tolist()
fuelsys = df['fuelsystem'].unique().tolist()

Super! These attributes will be added to the choices parameter when developing the drop-down feature.

As mentioned above, we will need to process the data, feed it to the model, and return the predicted price. Let’s define a function to do that:

#Function to model data to fit the model
def transform(data):
#Scale the data
sc = StandardScaler()

#Import the model
model= pickle.load(open('model.pkl','rb'))

#Dataframe with the new data
new_df = pd.DataFrame([data],columns = cols)
#Splitting categorical vs numerical columns
cat = []
num = []
for col in new_df.columns:
if new_df[col].dtypes == 'object':
cat.append(col)
else:
num.append(col)
#Creating the values to feed the model
x1_new = pd.get_dummies(new_df[cat], drop_first = False)
x2_new = new_df[num]

X_new = pd.concat([x2_new,x1_new], axis = 1)
final_df = pd.DataFrame(columns = cols_to_use)
final_df = pd.concat([final_df, X_new])
final_df = final_df.fillna(0)
X_new = final_df.values
X_new[:, :(len(x1_new.columns))]= sc.fit_transform(X_new[:,
:(len(x1_new.columns))])
output = model.predict(X_new)
return "The price of the car " + str(round(np.exp(output)[0],2)) + "$"

Now let’s create the elements in our gradio web app. We will have drop-downs or checkboxes for the categorical values, and sliders for the numerical values.

Here’s an example of a categorical input, and a numerical input:

#Categorical
car = gr.Dropdown(label = "Car brand", choices=carNameCap)
#Numerical
curbweight = gr.Slider(label = "Weight of the car (in pounds)", minimum = 500, maximum = 6000)

Now, let’s add everything in the interface:

Super! The app is ready for deployment!

Deployment

The app will be deployed on Heroku. Here are the 3 files you need to add in your repository so it works:

Procfile:

web: source setup.sh && python WebApp.py

setup.sh

export GRADIO_SERVER_NAME=0.0.0.0                              export GRADIO_SERVER_PORT="$PORT"

requirements.txt

numpy                            
pandas
scikit-learn
gradio
Flask
argparse
gunicorn
rq

Now let’s test the app!

Looks good! Test it out for yourself: link.

Conclusion

We managed to produce a fairly accurate model without the most popular attributes, year produced and mileage. Interesting further steps would be to create a model with the same attributes as I had, plus the year produced and mileage, and compare the predicted price.

Thank you for reading!

--

--

Kamen Damov

Mathematics and Computer Science student at University of Montreal.