Car Price Prediction with Machine Learning Models (Part 1)

David Obembe
Analytics Vidhya
Published in
7 min readJul 12, 2020

In the past couple of years, the application of machine learning has penetrated virtually every technological innovation. From the fairly simple classification of emails as spam or ham to complex applications such as random object detection or self-driving cars, this field is developing at a very fast pace. With the availability of enormous data and faster computer processors, it is safe to say there is still a long way ahead of us.

One common application of machine learning is making predictions based on some given features. The model learns from some labelled data and afterwards is made to make predictions for fresh unseen data. This type of learning is classified under supervised learning.

Supervised learning can further be divided into two: classification or regression. Classification problems answer the question, will it? while regression problems answer how much?

Photo by Marisa Buhr Mizunaka on Unsplash

In this article, I will be sharing with you a step-by-step approach to building a machine learning model that predicts the price of a car based on its mileage, years of usage, transmission type, number of previous owners and many other features. I evaluated the performance of 5 ML algorithms (Linear Regressor, Lasso, Ridge, Decision Tree and Random Forest Regressor) on the data.

This post will be split into two parts. This first part will cover the data cleaning and EDA while in Part 2, we will preprocess the data and apply the machine learning algorithms. By the end of the posts, you will understand the framework of virtually all machine learning models and perhaps be able to build your model if you’re a newbie. This is how the workflow would look like.

  1. Data cleaning and munging
  2. Exploratory Data Analysis (EDA)
  3. Feature Engineering
  4. Model evaluation
  5. Hyperparameter tuning
  6. Model selection
  7. Conclusion and ideas for future work

Without further ado, let’s jump into it.

Data Cleaning and Munging

The data was gotten from Kaggle and can be downloaded here. I’d start by importing the necessary libraries.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
from sklearn.preprocessing import RobustScaler
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from sklearn import metrics
import pickle
sns.set()

After that, I read the data and printed the first 5 samples. In our analysis, the column “Unnamed: 0” would not be necessary so it was removed from the dataframe. For brevity, the car’s brand was fine and not the entire car model. So, I split each entry by space and selected the first index.

# Read the data
train = pd.read_csv(r"C:\Users\wale obembe\Downloads\Compressed\245550_518431_bundle_archive\train-data.csv")
# Print the first 5 rows of the dataframe
print(train.head())
# Drop the 'Unnamed: 0' column
data = train.drop('Unnamed: 0', axis=1)
# Select only the brand of the car and not the model
data['Name'] = data['Name'].apply(lambda x: x.split(' ')[0].upper())
data.head()

The year column indicated the year at which the car was bought. But for easier understanding, I subtracted the year from the present year to indicate how long the car has been in use.

# Change the Year column to the years of usage
data['Year'] = dt.date.today().year - data['Year']

You will observe that some columns such as “Mileage” and “Power” had their units alongside the numbers. For our analysis, we need numerical values only. So I split by space again, selected the first index and converted to numeric values.

# Select the first word of the Mileage column
data['Mileage'] =data['Mileage'].apply(lambda x: str(x).split(' ')[0])
# Change 'nan' string to real missing values
data['Mileage'] = [x if x != 'nan' else np.nan for x in data['Mileage']]
# Convert the datatype to floating numbers
data['Mileage'] = pd.to_numeric(data['Mileage'])
# Select the first word of the Power column
data['Power'] = data['Power'].apply(lambda x: str(x).split(' ')[0])
# Change 'null' string to real missing values
data['Power'] = [np.nan if x=='null' else x for x in data['Power']]
data['Power'] = [np.nan if x=='nan' else x for x in data['Power']]
# Convert the datatype to floating numbers
data['Power'] = pd.to_numeric(data['Power'])
print(data['Mileage'].dtype, data['Power'].dtype)

Now they are floating-point numbers, I checked for missing values, which were present.

# Check for null values
data.isnull().sum()

I, however, observed that some entries were ‘nan’ or ‘null’ strings, which technically isn’t a missing value although they are meant to be. These strings were converted to missing values going forward. Next, the missing values were replaced with the median of the column and missing values in the ‘Engine’ column was dropped since it was a categorical feature. I replaced with median and not mean because I later realized the data had a lot of outliers. The mean here was a large value.

# Replace missing values with median value of the column
mileage_median = data['Mileage'].median()
data['Mileage'] = data['Mileage'].fillna(mileage_median)
power_median = data['Power'].median()
data['Power'] = data['Power'].fillna(power_median)
seat_median = data['Seats'].median()
data['Seats'] = data['Seats'].fillna(seat_median)
# Drop the remaining rows with missing value
data.dropna(axis=0, inplace=True)
# Check for missing values
data.isnull().sum().any()

Exploratory Data Analysis

First, I classified each feature as either categorical, numerical, continuous or discrete feature.

# Classify the non-numerical features
cat_features = [x for x in data.columns if data[x].dtype == 'O']
# Classify the numerical features
num_features = [x for x in data.columns if data[x].dtype != 'O']
# Classify the discrete features
discrete_features = [x for x in num_features
if len(data[x].unique()) < 25]
# Classify the continuous features
continuous_features = [x for x in num_features
if x not in discrete_features]
# Check them out
print(f"Categorical features: {cat_features}\nNumerical features: {num_features}\
\nDiscrete features: {discrete_features}\nContinuous features: {continuous_features}")

I then carried out some data visualization to better understand the data. I checked for the number of times each brand was bought using the count plot method in seaborn.

# Create a figure
plt.figure(figsize=(8, 8))
# Count the number of times a car was bought and plot the graph
count = sns.countplot(cat_features[0], data=data, order=data.groupby('Name').mean()['Price'].index)
count.set_xticklabels(count.get_xticklabels(), rotation='vertical')
# Plot the mean price of each car
price = sns.lineplot(data.groupby('Name').mean()['Price'].index, data.groupby('Name').mean()['Price'].values)
;

Maruti and Hyundai had the most sales, while Bentley, Lamborghini and a host of others had the lowest number of buyers. Intuitively, one would suspect that luxurious cars would be way expensive, hence the reason for its relatively low number of sales. To prove this claim, I plotted a line plot showing the price of each brand. Lamborghini and Bentley were the two most expensive cars.

Furthermore, I checked how each feature affects the price. The bar plot shows that cars used for 1 to 4 years, cars with mileage less than 33963km, cars with just one previous owner are more priced than the other categories. Also, diesel and electric cars more expensive than petrol-driven cars or other fuel types. Cars with automatic transmission had a way higher price than manual transmission vehicles.

# Create a figure with 6 subplots
fig, ax = plt.subplots(2,3, figsize=(15, 10))
fig.subplots_adjust(hspace=0.5)
# Graph each categorical feature wrt to thee Price
a = sns.barplot(data.columns[1], 'Price', data=data, ax=ax[0][0])
a.set_xticklabels(a.get_xticklabels(), rotation='vertical')
b = sns.barplot(pd.qcut(data[data.columns[2]], 4), 'Price', data=data, ax=ax[0][1])
b.set_xticklabels(['1 - 4', '4 - 6', '6 - 9', '9 - 22'], rotation='vertical')
b.set_xlabel('Age of Car')
c = sns.barplot(pd.qcut(data[data.columns[3]], 4), 'Price', data=data, ax=ax[0][2])
c.set_xticklabels(['171 - 33965', '33966 - 53000', '53000 - 73000', '73000 - 6500000'], rotation='vertical')
d = sns.barplot(data.columns[4], 'Price', data=data, ax=ax[1][0])
d.set_xticklabels(d.get_xticklabels(), rotation='vertical')
e = sns.barplot(data.columns[5], 'Price', data=data, ax=ax[1][1])
e.set_xticklabels(e.get_xticklabels(), rotation='vertical')
f = sns.barplot(data.columns[6], 'Price', data=data, ax=ax[1][2])
f.set_xticklabels(f.get_xticklabels(), rotation='vertical')
;

Finally, it was critical to check for the existence of outliers in your data. I did this by plotting a boxplot for the numerical features using seaborn. The circular dots after the top and bottom whiskers indicate the presence of outliers. Outliers are extremely high or low values in data distribution and their presence can severely mar your analysis. The outliers in the data will be dealt with momentarily.

data1 = data.copy()# Create a figure with 4 subplots    
fig, ax = plt.subplots(2,2, figsize=(16,8))
# Create a boxplot for the continuous features
box1 = sns.boxplot(y=np.log(data1[continuous_features[0]]), ax=ax[0][0])
box2 = sns.boxplot(y=np.log(data1[continuous_features[1]]), ax=ax[0][1])
box3 = sns.boxplot(y=np.log(data1[continuous_features[2]]), ax=ax[1][0])
box4 = sns.boxplot(y=np.log(data1[continuous_features[3]]), ax=ax[1][1])
;

I’d stop here for now. Watch out for Part 2 where I would begin with some feature engineering and conclude this project. Thanks for reading!

--

--