Predicting Vehicle Mileage

A data science project using pandas, seaborn, and scikit-learn

Konrad Siebor
Analytics Vidhya
7 min readOct 7, 2019

--

From an early age, I have always been interested in the car industry. I use to pour over model stats and read car magazines cover to cover. In this project, I took this passion and applied it to a data analysis project with significant real-world applicability.

The particular data set I worked with is made up of specifications for 398 unique vehicles compiled by the University College London and freely available on Kaggle. Each entry contains unique information such as cylinders, horsepower, acceleration, etc. Most crucially, each vehicle contains a label in the form of an mpg figure. This is the specific metric that I hoped to analyze and eventually predict.

In order to begin data analysis, I first imported the .csv file into a Google Colab notebook using pandas in order to convert it into a data frame. Here’s an initial look:

Data overview

It seems there are seven unique quantitative variables with one categorical variable. The first column (disregarding the index) is the all-important mpg figure. Most of these features seem fairly self-explanatory. The only ambiguous column is origin. What does this mean exactly? Well, a quick dive into Kaggle forums indicates that the labels 1, 2, and 3 are associated with America, Europe, and Asia respectfully. After fully understanding the data features, the first step is data clean-up. I first checked whether the data contained any null values.

Null value count

Perfect! It seems as though I’ve hit the jackpot: a data set with no null values. Let’s try to do some further manipulation.

Attempted data manipulation

What’s this? It appears as though the horsepower column consists of Object types rather than integers. Even more worryingly, it appears as though the creators of the data set used questions marks (“?”) rather than null values in order to represent unknown values. Fortunately, a quick check indicates that there are only 6 rows with this problem, so we can go ahead and drop them in order to continue with our data analysis.

First off, I wanted an overall representation of the data by taking the mean of the quantitative features.

Means of quantitative features

Interesting. The average model year is ’75, which make sense considering this data set was compiled in ’83. The average mileage is also a very respectable (for the time) 23.45 mpg. I was interested in extracting the specific makes represented in the data set, which would be perfect for a seaborn visualization.

Car manufacturer distribution

This data set definitely seems to be skewed towards domestic manufacturers, with the most popular being Ford and Chevy. Next, I dived deeper into seaborn’s powerful visualization library. For example, it has the capability to compare two specific categorical variables, such as horsepower and displacement.

horsepower vs displacement

The jointplot tells us two distinct things. First of all, horsepower is positively correlated with displacement. There really is no replacement for displacement. Secondly, there appears to be a large density of cars with approximately 50–100 horsepower and a displacement of 50–150. Lets, continue further, seaborn can showcase even more complex relationships with multiple variables.

In the above scatter-plot, mpg is plotted against horsepower. Furthermore, cars with a higher weight are represented with a darker color while larger cylinder counts are represented by larger circles. The figure showcases some interesting trends. In general, as horsepower, weight, and cylinder count increase, their mpg increases in a trend that appears to resemble an exponential decay.

With these data visualizations out of the way, I wanted to transition in predicting mpg using machine learning algorithms. It’s crucial to recognize that I want to predict mpg, which is a continuous numerical variable. For that reason, something like a Logistic Regression or classifier would not be appropriate as they both have discrete output. My initial instinct was to implement a Multivariate Linear Regression, which would work nicely due to how many categorical features we have.

My machine learning implementations were conducted entirely through Python’s sci-kit learn library. The first vital step was splitting the data set into training and test data. The regression would optimize and fine-tune its parameters on the training data. It performance would then be evaluated on test data that it had not seen before. Fortunately, sci-kit’s train-test-split automates this process.

train-test-split

There are several important things to notice in this line of code. First off, I needed to drop the ‘mpg’, ‘car name’, and ‘make’ features of the data set to make the X_train data-set (the input during the training phase). This is because ‘mpg’ is the label itself while the later two features are categorical rather than quantitative. Next, I specified a 0.3 test size, which means that 70% of the data was used for training and 30% was used for testing. Finally, the random state was set at 42. This is crucial because 42, as we all know, is the answer to the ultimate questions of life, the universe, and everything.

Next I instantiated the model, used the .fit function to train it, and applied it on the test data. There are several ways to evaluate this model, but I started with two visualizations.

The first scatter plot showcases the predicted mpg values plotted against the actual mpg values. Ideally, this would simply be a y = x line if the predictions were perfect. Although that’s not the case, its reassuring to see a line that at least appears to come close to this. The second figure is a distribution of the residuals. Due to statistical theory, an effective model has residuals that should be normally distributed, a trend that can be seen in the graph.

Finally, there are several quantitative metric that can be used to characterize the model. These include r², mean absolute error, mean squared error, and root mean square error. All of these are included in scikit-learn’s metrics library.

The model had a respectful RMSE of 3.17 and an r² of 0.81; however, I was thirsty for a better model. I decided to move forward with a Multilayer Perceptron Regressor, a simple neural network model that outputs a continuous value. Once again, I went through the same process of splitting the date into training and test data sets. There were two key additional steps I had to take, however, when compared to the linear regression. First, I standardized the data such that all the quantitative variables were on the same range or scale of values. This is a factor that MLPs are sensitive too.

Second, the MLP Regressor offered a much wider number of model parameters to tinker with. Rather than randomly imputing different options such as learning rate, the number of iterations, number of hidden layers, etc., I decided to implement a technique called grid search that would automatically output the best combination of a specified set of parameter values called the parameter space.

After running for quite some time, the grid search eventually found the best combination of the three parameters:

Finally, I was able to use this optimized regressor to predict mpg and analyze the predictive power of the model.

Unsurprisingly, this more sophisticated model with fine-tuned hyper-parameters had a significantly higher r² value of almost 0.87 and a lower RMSE. Overall, I was fairly satisfied with the results, especially considering how relatively small the original data set was.

In the future, this accuracy of these models could be improved through further feature engineering. For example, I could create an additional column that used the horsepower and weight data to create a power-to-weight ratio feature. Additionally, I could extend the grid search parameter space to include more of the hyper parameters included in the default MLP Regressor implementation. Finally, I would be excited to see how my predictions would fair if I used a more powerful library, such as Tensor Flow, as the machine learning backend.

--

--