Car Price Prediction : End to End Machine Learning Web application

Gaurav Shrivastav
Analytics Vidhya
7 min readJul 13, 2021

--

INTRODUCTION

Here, we get data from the car website CarDekho.com, filled with information on a wide variety of cars, including their selling price and present price. We realize that we can use this data to make sure we get a good deal on a new car. In particular, we can figure out exactly how much one should pay for a specific type of car.

Motivation

Deciding whether a used car is worth the posted price when you see listings online can be difficult. Several factors, including mileage, make, model, year, etc. can influence the actual worth of a car. From the perspective of a seller, it is also a dilemma to price a used car appropriately. Based on existing data, the aim is to use machine learning algorithms to develop models for predicting used car prices.

About the dataset

This dataset contains information about used cars listed on www.cardekho.com This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning. The columns in the given dataset is as follows:

  • Car_Name
  • Year
  • Selling_Price
  • Present_Price
  • Kms_Driven
  • Fuel_Type
  • Seller_Type
  • Transmission
  • Owner

Researchers more often predict prices of products using some previous data and so did Pudaruth's who predicted prices of cars in Mauritius and these cars were not new rather second hand. He used multiple linear regression, k-nearest neighbours, naïve Bayes and decision trees algorithm in order to predict the prices. The comparison of prediction results from these techniques showed that the prices from these methods are closely comparable. However, it was found that decision tree algorithm and naïve Bayes method were unable to classify and predict numeric values. Pudaruth’s research also concluded that limited number of instances in data set do not offer high prediction accuracies.

METHODOLOGY

This research aims to develop a good regression model to offer accurate prediction of car price. In order to do this, we need some previous data of used cars for which we use price and some other standard attributes. Car price is considered as the dependent variable while other attributes as the independent variables.

Random Forest is an ensemble learning based regression model. It uses a model called decision tree, specifically as the name suggests, multiple decision trees to generate the ensemble model which collectively produces a prediction. The benefit of this model is that the trees are produced in parallel and are relatively uncorrelated, thus producing good results as each tree is not prone to individual errors of other trees. This uncorrelated behavior is partly ensured by the use of Bootstrap Aggregation or bagging providing the randomness required to produce robust and uncorrelated trees. This model was hence chosen to account for the large number of features in the dataset and compare a bagging technique with the following gradient boosting methods.

Random forest Regressor

Data Visualization

Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

Correlations of each features in dataset:

Categorical variables represent types of data which may be divided into groups. Examples of categorical variables are race, sex, age group, and educational level.

Here we have 4 categorical data features,

Transmission- [Manual, Automatic]

Seller_type- [Dealer, Individual]

Fuel_type- [Petrol, Diesel, CNG]

Owner- [0, 1, 3]

To calculate the vehicle age, we are adding a new column and difference of current year and vehicle’s year will be the age of vehicle.

We can see here that these 4 features (Transmission, Seller_type, Fuel_type, and Owner) are categorical data and impacting the selling price of the car.

Present price of a car directly influences Selling Price prediction. Both are highly correlated and here directly proportional to each other.

Car age is affecting negatively as the Selling Price decreases for an older car.

Data Preprocessing

In any Machine Learning process, Data Preprocessing is that step in which the data gets transformed, or Encoded , to bring it to such a state that now the machine can easily parse it. In other words, the features of the data can now be easily interpreted by the algorithm.

A feature is an individual measurable property or characteristic of a phenomenon being observed.

One Hot Encoding is a process in the data processing that is applied to categorical data, to convert it into a binary vector representation for use in machine learning algorithms.

To produce an actual dummy encoding from a DataFrame, we need to pass drop_first=True.

Feature Importance

Here, We are using ExtraTreeRegressor to get the important of the features in the dataset.

Visual representation of Feature Importance

Here we can see Present Price is more important than other features.

Now split the data into train and test for building a model.

Modeling

Here I used Random Forest Regressor algorithm for predicting the car price. It is based on Decision trees.

Then I Used Hyperparameter Tuning for improving the performance of the model and algorithm. I have used RandomizerSearchCV for hyperparameter tuning.

In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameter for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process. By contrast, the values of other parameters (typically node weights) are learned.

rf_random = RandomizedSearchCV(estimator = rf, param_distributions = random_grid,scoring=’neg_mean_squared_error’, n_iter = 10, cv = 5, verbose=2, random_state=42, n_jobs = 1)

Now fit the data in the model and this will take time to train the model.

Model Evaluation

This is best way to select the model by seeing its performance. And it is very important and integral part of model development.

Mean Absolute Error (MAE): 0.9047490109890105Mean Squared Error (MSE): 4.098584280417587Root Mean Squared Error (RMSE): 2.024496055915542

Model Building with Flask API

For building the model first pickle the model which you have created in Jupyter.

Create a new app.py file. Now, import every important module and library to deploy the model. Also load the model in the app.py file.

For the web application you need to create a html file for the structure of website and you can add CSS and JS for styling and other thing.

For all the files and code, please visit this GitHub link: gaurav21s/CarPricePrediction

Deployment on Heroku

To deploy the model on Heroku, put your project on GitHub and create a account on Heroku. Please use python as programming language while setting up the account.

Then create a new app, connect your github repository to Heroku.

Now go to deploy section and connect the github account and scroll down to Manual deploy and click on deploy.

For deploying any model, the very important file is requirements.txt . In this all the libraries and dependencies are mentioned. Be careful while creating this file. Use below line to freeze(create) your requirement file in the end.

pip freeze > requirements.txt

Please use a new environment for this project.

Hurray!! Now you have deployed the WebApp on the Heroku.

Congrats! you have finished this project.

Thank you.

Please visit my LinkedIn profile: Gaurav Shrivastav | LinkedIn

--

--