Predicting Car Prices Using Machine Learning and Data Science

Using features such as MPG, model, year of manufacture and engine type, predict the price of cars with machine learning and data science.

Published in

ODSCJournal

11 min readMay 10, 2022

Many are quite aware of the terms machine learning and data science. In fact, it became so ubiquitous that a lot of software engineers and web developers are also aware of these terms and use them in their day-to-day work. There are numerous applications ranging from self-driving cars, crop management, predicting manufacturing defects, and healthcare. With this technology, there is a positive shift in the way in which the market is operating and treating machine learning and artificial intelligence.

Note: The Data was taken from https://www.kaggle.com/CooperUnion/cardataset

Data science is used to extract interesting insights and make decision-makers take the best direction for their companies. In the case of car prices prediction, companies could use this technology to determine the prices of new cars that they produce which will help them to set the most accurate prices for their cars based on the market value of cars. As a result, optimal prices for cars could be set leading to better growth and outcomes for car manufacturers respectively.

It is now time to jump into the code and see how to use machine learning to predict the prices of cars.

Reading the libraries

Since we are about to use quite a few machine learning models, there are a lot of libraries that we are importing. Using libraries make things easy as it is not needed to write the code of their functionality from scratch. Sometimes libraries are not installed when using a particular environment. Hence, care must be taken to install and download the libraries for the task at hand.

Pandas is a library that is used for data analysis. It also acts as a wrapper over Matplotlib and NumPy libraries. For instance, .plot() from Pandas performs a similar operation to Matplotlib plot operations.

Seaborn provides a high-level interface that is used to draw informative statistical plots.

NumPy is used for performing a wide variety of mathematical operations for arrays and matrices. In addition to this, the steps taken for computation from the NumPy library are effective and time-efficient.

Sklearn (Scikit-learn) is one of the most popular and useful libraries that is used for machine learning in python. It provides a list of efficient techniques and tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction. As could be seen from the code above, there are various sub-packages from Sklearn used for machine learning purposes respectively.

Missingno is a really handy library for plotting the null values present in our data. Usually, datasets that are large come with missing values in them. Rather than manually finding those missing values, it would be really effective to plot the values to get the best results respectively.

Warnings is a library that is used generally to suppress the warnings in our code. Though it is good to have warnings in our code, sometimes they interfere with the working of the code, and we don’t get the right display for the outcomes respectively.

Category_encoders are used for encoding data into a form that is processed by the machine learning models. For more information about categorical encoders, feel free to take a look at the blog below.

An Easier Way to Encode Categorical Features | by Rebecca Vickery | Towards Data Science

%matplotlib inline is used for plotting inside the Jupyter notebook rather than using an external window to the plot.

Exploratory Data Analysis (EDA)

It is now time to perform exploratory data analysis for the data to get interesting insights from data. Below are some code cells along with their plots which give us good insights into our data.

Car Companies with the total number of cars (Image by Author)

Taking a look at this plot shows that Chevrolet manufacturer has the highest number of cars. On the other hand, there are manufacturers such as Bugatti and Genesis that don’t usually have a large number of cars. There are other car brands and the number of cars that we have considered in our data as well.

Number of cars in different years (Image by Author)

As could be seen from the plot, our data accounts for the most recent cars without having a lot of cars that are quite old. This is good as our data accounts for the most recent prices of the cars rather than relying on the significantly past data.

Total Number of Cars based on Size (Image by Author)

It is seen that Compact cars are more in number compared those categories such as midsize and large segments. This is also reflective of the real-world where most of the cars are compact or midsize compared to a few models that are large in number. Therefore, the data is quite reflective of the real-world so that accurate predictions could be made on the test set.

Missingno is a useful machine learning plotting library used for understanding the total number of missing values in the data. We see that ‘Market Category’ feature has a lot of missing values as denoted by the white strips. Apart from this, we also see that there are some missing values in Engine HP and Engine Cylinders that must be taken into consideration when we are giving the data to the machine learning models.

Average Price of cars in different years (Image by Author)

It is important to group the information so that specific attributes and features could be extracted and analyzed. In our case, we have grouped the cars based on the years of manufacture and then took the average price of the cars after grouping them. We see that as the years progress, there is a steady average increase in the prices of cars from our data respectively.

Average Price of Cars with Different Transmission Types (Image by Author)

Cars that are both automatic and manual are considerably higher in terms of prices compared to the others. We also know that usually manual cars cost less compared to their automatic counterparts. That is being reflected in our data as well.

Average Prices of Cars Based on Manufacturer (Image by Author)

Now comes a very interesting plot and visualization. It could clearly be seen that the manufacturer ‘Bugatti’ has significantly higher prices than other manufacturers combined. There are other companies that are close to the price range of Bugatti manufacturers. Some of the companies include Maybach, Rolls-Royce, and Lamborghini. Taking a look at the graph really helps us understand the price range of cars as can be seen above.

Heatmap of the Features (Image by Author)

Heatmap is useful under the Seaborn library. It gives us a good colored estimate of the values. Depending on the palette chosen, we either get bright images for higher values or vice-versa. We are plotting a correlation plot between the various features.

It is seen that the ‘city mpg’ and ‘highway MPG’ are quite correlated to each other with a correlation coefficient being equal to 0.89. Similarly, the features ‘Engine Cylinders’ and ‘Engine HP’ are also related. All the remaining features seem to be related in a negative way or uncorrelated.

It is important to note that the values of the correlation coefficient lie between the range -1 to 1 respectively. The higher the positive correlation between the features, the more would be correlation coefficient value would move to 1. The higher the negative correlation between the features, the more would the correlation coefficient value move to -1 respectively.

Data Preprocessing

Target encoding is a useful data preprocessing method where the categorical features are converted to numerical features so that machine learning operations could be performed.

It is also time to create useful features that aid in predicting car prices. One feature is the ‘Years of Manufacture’ of the car. If the car is manufactured for a long time, the price would be influenced by this feature. As you can see in the above code snippet the ‘Years of Manufacture’ feature is created and assigned to the data.

It is also important to divide the data into training and test set. The training set is used to train the machine learning models under consideration. After successfully training the models, we take a look at their performance over the test set where the output is already known. After the machine learning models predict the outcome for the test set, we take those values and compare them with the known test outputs to evaluate the performance of our model.

It is now time to perform one-hot encoding where all the categorical features are converted to numerical features which help in processing for the machine learning models. We take all the categorical features such as ‘Engine Fuel Type’ and ‘Driven Wheels’ and convert them to numerical features for efficient computation.

We are scaling the features so that the mean of the features is equal to 0 and the standard deviation is equal to 1. Therefore, we are bringing the values to this scale so that processing takes place efficiently.

Machine Learning Analysis

After successfully converting the values into numerical vectors, it is now time to use our machine learning models for predicting the prices of cars. We will first start with a very simple model such as Linear Regression and slowly increase the complexity and how this accounts for an increase or decrease in the mean squared error.

Linear Regression

We first define an instance of Linear Regression model before using it. After this step, we fit our training data. We later use our test input as an input to the model before it makes its prediction. Finally, we append the mean squared error and mean absolute errors so that those values are plotted later.

It is now time to see how our model performs on the test set. One of the interesting plots that could be used for evaluating the performance of the machine learning models is regplot with x label being our actual price values and y label is equal to the prediction values. The closer the values are together; it means our model is doing well. If there is a big scatter in the output plot, it means that our model is not doing well on the test set.

We see that the linear regression model is doing quite well in terms of its prediction of the prices. Let us also go over the other models that would improve the results.

Support Vector Regressor

We now use Support Vector Regressor to see its performance over the car prices prediction.

This plot clearly indicates that the model is not performing better than the Linear Regression model as the points between the prediction and the actual values are quite scattered from the line. Let us also use many other models and select the best model.

K- Neighbors Regressor

It is seen clearly that the model is performed much better than the Support Vector Regressor model as the points are not scattered. It also is performing well compared to the Linear Regressor model that we used earlier. Therefore, this model could be later deployed. But we can also test other models before making this conclusion.

PLS Regression

PLR Regression is also performing quite similarly to the K Nearest Regressor algorithm. Therefore, it would be good to see the mean squared error and mean absolute error of the 2 models. This will be covered in the final part of this article. For now, brace yourself for the other machine learning models.

Decision Tree Regressor

Wow! look at how close the actual values are to the predicted values. There almost seems to be a straight line between the predictions and the actual values with slight deviations. This model, as far as I know, is performing much better than the previous models that we have seen.

Gradient Boosted Decision Regressor

In the plot, there are some values that our model is not able to accurately predict. We see scatter here and there with Gradient Boosting Decision Tree. In addition to this, it is worthy to note that the time complexity of this model is also high compared to the Decision Tree. Therefore, we now stick with Decision Trees as our model for deployment.

MLP Regressor

MLP Regressor also seems to be performing well but not as much as that of Decision Tree Regressor. After testing and understanding many machine learning models, we see that Decision Tree Regressor performs the best on the test set. Let us now look at the summary of all the models and come to a conclusion.

Mean Absolute Error of Models

Based on the outcomes from all the models, it is evident that the decision tree regressor has the lowest mean absolute error. Support Vector Regressor has a very high mean squared error. Therefore, it would not be appropriate to use that model for our task for car prices prediction. However, the same model might perform the best for other tasks. Therefore, we should explore many models before deploying them in real-time.

Mean Squared Error of Models

In terms of the Mean Squared Error, we see K Nearest Regressor is performing well. But we stick with Mean Absolute Error as a metric for our deployment as it is also interpretable. Thus, we would be using Decision Tree Regressor for deploying in real-time to predict future prices of various cars based on the features such as MPG, cylinders etc.

You might take a look at the full code which is given below.

Car Prices Predicting by Suhas Maddali

Conclusion

Finally, we have reached the end of the article. Hope you found this article helpful. I had to make it long so that each and every step is highlighted and clear for the readers. Feel free to share your thoughts and feedback. Thanks!

Below are some of the platforms where you can get to know my work and reach out.

LinkedIn: (4) Suhas Maddali, Northeastern University, Data Science | LinkedIn

GitHub: suhasmaddali (Suhas Maddali ) (github.com)

Medium: Suhas Maddali — Medium

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com