Car Price Prediction: A Beginner’s Approach to Machine Learning

Oyebamiji Micheal
6 min readMay 17, 2024

--

In this article, we are going to be solving a simple machine learning problem together which involves predicting the prices of cars based on features such as engine size, fuel system, bore ratio, etc. The purpose of this article is to introduce newbies to what a typical machine learning workflow looks like. This article is not a tutorial on “what pandas is” or “how to get started with numpy”. Readers are assumed to be familiar with the basics of pandas and related libraries.

Download Dataset

The first step is obtaining a dataset. There are numerous websites with open source datasets such as Kaggle, Github, UCI and so on. The dataset used in this article is sourced from Kaggle, a website which hosts data science competitions and open source data. The link to the dataset can be found here: car-price-prediction-hellbouy.

Data Exploration

Understanding the dataset we are working with is crucial in machine learning. Some of the common data exploration steps are included below

  • Checking the first five rows of the dataset: This is the first step in getting a glimpse of what our dataset looks like and the type of data present in each column. This step is particularly important so we do not work blindly.
Fig 1: The first five rows of the dataset
  • Data Size: Knowing the size of the dataset is important for a lot of reasons. For example, the size of our dataset influences the type of algorithm we choose.
Fig 2: The size of the dataset
  • Statistical Properties: The next step in our data exploration is to check some statistical properties of the dataset such as mean, standard deviation and percentiles.
Fig 3: Statistical distribution

These properties help us understand the distribution of our data along with some potential outliers. For example, the standard deviation of car prices is 7989 while the median is 10295. This suggests that there are some cars that are highly expensive with their price differing significantly from the rest.

  • The last step is to check the information about each column. This is usually done using the info() function from pandas.
Fig 4: Dataset information

This gives us an idea of the datatype of each column and null values.

Exploratory Data Analysis

The best way to truly understand a dataset is to visualize it. Most data scientists skip this step as they see it as boring or not so important. However, this step is very important in order to know the kind of preprocessing to be done such as outlier removal, data clipping, log transformation, etc.

Fig 5: Distribution of car prices

Fig 5 is a histogram overlaid with a density plot and is used to analyze the distribution of car prices. The histogram revealed a concentration of cars within the lower price range, particularly below $15,000, indicating a skewness towards more affordable options. The overlaying density plot emphasized the skewness and provided a smooth representation of the data distribution.

Fig 6: Correlation between car prices and various car attributes

In the correlation plot in fig 6, we observed a positive correlation between engine size and car price which suggests that cars with larger engines tend to be more expensive. This trend was similarly reflected in car length, width, and horsepower. However, notably, there was no observed correlation between a car’s height and its price.

Even though we know that our data is skewed from the descriptive statistics, visualizing this provides us with more information about the price concentration and potential outliers.

Data Preprocessing

In the Data preparation and preprocessing phase, the first step is to separate the features from the target variable. This is a crucial step in any machine learning workflow as it allows for the independent manipulation and analysis of the features and the target. The features are the variables that the model will learn from, while the target is the variable that the model will predict.

Fig 7: Separating features from target

The next step is to identify numeric and categorical columns in the dataset. This is important because numeric and categorical data require different preprocessing techniques. For numeric data, scaling was applied using the StandardScaler() function. Scaling ensures that all numeric features have a similar scale, preventing any one feature from dominating others due to its scale. For categorical data, one-hot encoding was applied using the OneHotEncoder() function. One- hot encoding transforms categorical data into a format that can be understood by machine learning algorithm.

Fig 8: Scaling Numerical Columns and One Hot Encoding Categorical Columns

Finally, a pipeline was built for scaling and one-hot encoding using the Pipeline() function, and these transformations were applied to the appropriate columns using the ColumnTransformer() function. The fit_transform() method was then used to apply these transformations to the features.

Fig 9: Column Transformation using Column Transformer

We then partition the dataset into training and testing sets with 20% of the data reserved for testing. This process ensures that the model can be evaluated on unseen data, providing a measure of its ability to generalize to new data.

Fig 10: Data Partitioning

Model Building

Fig 11: Model Building

Two machine learning models: Linear Regression and Random Forest were trained on the data. The trained models were evaluated using root mean square error and mean absolute error.

Fig 12: Model Evaluation

The Linear Regression model had a high RMSE of 8759.66 and MAE of 6132.37 indicating a poor fit to the data. In contrast, the Random Forest model significantly outperformed with a much lower RMSE of 1835.40 and MAE of 1338.14 demonstrating its superior predictive accuracy and fit to the data. These metrics provide a quantitative measure of the models’ performance and their ability to accurately predict car prices.

Conclusion

In this article, we demonstrated a beginner-friendly approach to machine learning by predicting car prices using features like engine size and fuel system. After obtaining the dataset from Kaggle, we explored its structure, visualized data distributions, and identified correlations. We then preprocessed the data by scaling numeric features and one-hot encoding categorical features. Finally, we built and evaluated two models: Linear Regression and Random Forest, with the latter showing superior predictive accuracy. This workflow provides a foundational understanding of key steps in a machine learning project.

The code used in this article can be found in this repository: Oyebamiji-Micheal/Car-Price-Prediction: A Beginner’s Approach to Machine Learning

--

--

Oyebamiji Micheal

Proffering solutions to real world problems using data science and machine learning along with advanced statistics, data structures and algorithms