Predicting Selling Price of Car using Machine Learning

6 min readFeb 18, 2022

Hi everyone. The price of a used can be affected by various factors like brand of the car, mileage, number of owners, fuel type etc. In this article, I am going to explain you how we can predict the selling price of a car using machine learning.

Introduction

In this project, we are going to predict the selling price of a used car based on various features like car brand, showroom price etc.

I collected the dataset for this project from kaggle. For the dataset, visit https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho?select=car+data.csv

About the dataset

This dataset was provided by cardekho.com. You can download it from kaggle.com

This dataset contains information about used cars. The columns that are given in the dataset are:

Car_Name
Year
Selling_Price
Present_Price
Kms_Driven
Fuel_Type
Seller_Type
Transmission
Owner

1. Importing libraries and dataset

Let’s start by importing the necessary libraries for this project and import the dataset.

2. Exploratory Data Analysis

After importing the dataset, we have to analyze the dataset to discover the pattern of the dataset. It will help us to get the statistical summary of the dataset and I also used some graphical representations.

This dataset contains details about 301 cars and 9 features.

Handling Null values:

data.isnull().sum() will give the number of null values in the dataset.

There are no null values in our dataset.

3.Feature Engineering

In this step, the raw data will be manipulated and transformed into features. It will improve the performance of the machine learning model.

The first thing we are going to do is drop the unwanted feature from the dataset. Hence ‘Car_Name’ column is dropped.

Now, I add a new column called ‘No_of_years’ which is calculated by subtracting the ‘Year’ from current_year. This column contains information about how many years the car has been used.

Now, we can drop the ‘Year’ and ‘current_year’ columns from the dataset.

Handling Categorical data:

Categorical data present in the dataset are-

Seller_Type[‘Dealer’, ‘Individual’]
Transmission[‘Manual’, ‘Automatic’]
Owner[0, 1, 3]
Fuel_Type[‘Petrol’, ‘Diesel’, ‘CNG’]

One-Hot encoding

To handle categorical data, we are going to perform one hot encoding. In one-hot encoding, the categorical data are converted into binary vector representation.

To perform one hot encoding, pd.get_dummies() is used.

The first dummy variable is dropped to prevent it from the dummy variable trap. The Dummy Variable Trap occurs when two or more dummy variables created by one-hot encoding are highly correlated. This means that the individual effect of the dummy variables in the prediction model can not be interpreted well because of multicollinearity.

4.Data Visualization

In this step, we are going to visualize the correlation of the features in the dataset.

Correlation is an indication about the changes between two variables. This correlation matrix shows which variable is having high or low correlation in order to another variable.

sns.heatmap() function makes this task easier.

Feature Importance

Feature Importance is the technique of assigning scores to the input features of a model. It simply assigns the importance of each feature.

Here I used ExtraTreeRegressor to assign importance to each feature. And I visualized with bar chart.

It is clear from the bar chart that the feature ‘Present_Price’ has more importance than other features.

5. Training and Test Set

Before feature selection is to perform, we have to split the dataset into training and test set. Because if we have done it before, there will be information leakage from the test set. Training is used for training the model and the test set is for evaluation purpose.

6.Model training

This project is a regression model. Here, I used RandomForestRegressor on the train set for model training. We can fit any other regression algorithm and we should the best model with best accuracy.

Hyper Parameter Tuning

Then I used RandomizedSearchCV for hyper parameter tuning. This will improve the performance of the model.

Hyper parameters are the parameters that cannot be directly learned from the regular training process. RandomizedSearchCV moves within the grid in random fashion to find the best set hyperparameters.

Now, we can fit the data in the model.

7. Evaluation of the model

In this part, we are going to evaluate the accuracy of the model. It is important to see how well the model is performing.

Now, we can evaluate the model with the test set. I used sns.distplot() to visualize the performance of the model.

And I also used scatter plot to see the accuracy of the model.

The data points came close to the best fit line. The points are aligned in a linear line.

8. Saving the model

Now, we can the model as a pickle file. Pickle is a useful python tool to save your models and it allows us to share and commit the machine learning model.

9. Model Deployment

Deploying the model can be done by creating a user interface. Here, I used flask to create a HTML file. Flask is a web framework which is written in python.

Deploying on Heroku:

Now we can upload our project to the Github. And then create an account in Heroku. Then create a new app and connect your Github repository to Heroku. Finally, click on deploy.

Now, You have generated a link to your project.

Visit this link: http://car-sellinprice.herokuapp.com/

Yeah! Finally, you have done the project.

Github link for this project: https://github.com/jaysri125278/car-price-prediction