Predicting the Environmental Impact of Vehicles: A Statistical Analysis

19 min readFeb 3, 2024

This study aims to develop a predictive model to determine the impact of vehicles on CO2 emissions and smog ratings. A dataset for 2022 containing information about various types of vehicles from different companies will be analyzed. Statistical techniques like regression analysis will be used to predict CO2 emissions, and classification techniques will be applied to predict smog ratings. The results will inform policymakers, decision-makers, and the general public about vehicle emissions, regulations, and environmentally friendly practices.

Research Questions:

Which factors contribute the most to CO2 emissions and smog ratings?
Is there a significant difference in CO2 emissions and smog ratings between different input variables?

Hypotheses:

H0: There is no significant difference in CO2 emissions and smog ratings between different input variables.
Ha: There is a significant difference in CO2 emissions and smog ratings between at least two types of input variables.

Question:

Do all companies produce vehicle classes with the same smog rating?
H0: There is no significant difference in the mean of CO2 emissions and smog ratings between different companies for the same vehicle types.
Ha: The mean of CO2 emissions and smog ratings differs between at least two companies per vehicle type.

Assumptions: The dataset may not represent the entire population of vehicles for 2022 and could be biased towards certain types of vehicles or manufacturers. Other factors, such as hybrid type, battery type, drive mode, drivetrain, and fuel type, etc, may also contribute to the assessment of environmental impact. During the exploratory data analysis, random removal of rows for certain conditions may slightly affect the results, but the overall conclusion and analysis of the project remain the same.

The dataset used for this project is called ‘2022 Fuel Consumption Ratings’ dataset. It is a public dataset available on Kaggle that contains information on fuel consumption ratings, CO2 emissions ratings, and other vehicle attributes for different models of cars sold in Canada for the model year 2022. The dataset consists of 946 observations and 15 variables. To read the csv file and inspect the dataset, we will use the following commands.

# Read dataset
df<-read.csv('Data/MY2022 Fuel Consumption Ratings.csv')
#### Inspect the dataset ####
str(df)
summary(df)
# Get the class of each column
sapply(df, class)

The combination of str, summary, and sapply help in identifying the data type, summary, and statistical summary of each column and allows to gain insights into the dataset that will help in performing data manipulation efficiently.

Dataset Preparation

In this section, we will delve into the essential steps of data preparation to ensure accuracy, consistency, and relevance, aligning the data with the specific requirements of our analysis. By meticulously attending to these tasks, we lay the foundation for a robust and comprehensive exploration of the data.

To explore the key aspects of data preparation in detail, we divide the section into four subsections:

Removing Irrelevant Columns
Check for NaN values
EDA
Train/Test Split

1. Removing Irrelevant Columns

Based on manual observation and domain knowledge, we remove certain columns from our dataset as they do not serve a significant purpose in our analysis. The columns we choose to remove are:

Model.Year:

We do not need the model year as all the models are from 2022.

Model:

We do not need the model, as our analysis is to assess vehicle types and companies. So, this does not align with the scope of our project.

Fuel.Consumption..City..L.100.km.

We do not need this column as we will be using the city and highway combined column (The combined rating (55% city, 45% highway) is shown in L/100 km).
Why? Our focus is not on figuring out the difference in fuel consumption between city and highway.

Fuel.Consumption.Hwy..L.100.km..

Fuel.Consumption.Comb..mpg..

Since we are using, the combined rating (55% city, 45% highway) is shown in L/100 km, we do not need this column as it provides the same information.

CO2.Rating

We are using regression analysis to predict CO2 emission, this can be manually calculated by defending a range of CO2 Emission and estimating classes

2. Check for NaN values

Next step in our analysis is to check if the dataset has any missing values. Upon applying is.na function, we conclude there are no missing values in the dataset.

3. EDA

We will first check if the Make and Vehicle Type columns are balanced. Since we need to use our regression and classification results to represent the behavior of the entire vehicle population, we want to ensure that the data is representative of all the companies. For this, we use the group_by function to group our data based on each vehicle class and company make.

After plotting the scatter plot and calculating the quantile ranges, we found that our data has an IQR of 5.75, with a Q3 value of 8 and a Q2 value of 4. To ensure that our analysis is based on sufficient data, we will remove the values lower than Q2 and limit the values above Q3 to Q3. This approach will ensure that our data is not skewed by outliers and remains representative of all companies.

3.1 Categorical Columns

In this section we will visualize each categorical column to see the distribution and how it is associated with our target variables.

Transmission (Transmission and Gear):
As the Transmission column is made up of transmission type and the number of gears, we can split the column to reduce the factors of our categorical column and define an ordinal relationship between our gears. We will split the transmission column into transmission type (Transmission) and gear (Gear). This will help reduce the unique values per column, allowing us to separately assess the significance of transmission type and gear. However, before we perform the split, we visualize the Transmission column to find out if there are any dominant classes. Upon plotting the histogram, the results are as follow:

From this, we can clearly observe that AS8 is influencing a bias in the column. Furthermore, when we calculate the R-squared (R2) score, it comes out to be 0.399, which states that 40% of the variance in our regression variable can be explained by the transmission. In order to make sure it is not biased, we undersample the data by removing approximately 50 values that have a transmission of AS8.

Now we move forward and split the column, we plot both columns to see if our data is balanced or not.

Transmission:

After splitting the transmission type into a separate column, we get five types of transmissions, represented by the histogram below.

Although the distribution of classes in the column is still dominated by the AS category, we can move forward as each variable in the column has a significant frequency and an impact on the regression variable. The p-value of each variable on CO2.Emissions is represented below.

Furthermore, since our transmission column is a categorical column that has a nominal relationship, we need to transform our data appropriately for our model. For this purpose, we will perform dummy encoding to transform our data.

Gear

For this column, we assign levels to each category as the relationship between each value is ordinal. . The categorical values present within the column are 0, 6, 7, 8, 9, 10. For this we use the as.factor function. Once this is done, we move forward to check the distribution of our column.

After manually observing the graph, we can observe that Gear 1 and Gear 5 have a significantly low frequency. Regardless of their association with the target variable, we will remove them, as even if they do have a strong association with our target variable, the model might fail to capture it properly. Moving on, we check the association between the rest of the variables to assess how they influence our model.

For the rest of the variables in the Gear column, we can observe that each has a strong association with our target variable.

2. Cylinders:
Since Cylinders is a categorical column with values that have an ordinal relationship. We will use the as.factor function to define levels within the column. The unique categories within the column are 3, 4, 5, 6, 8, 10, 12. The distribution of each variable is represented below.

Following the same principle applied to the Gear column, remove the values that have significantly low frequency as our model might fail to capture its relationship with our regression variable. Upon doing that, the association between the remaining variables and the target variable is represented below.

While we can observe that each remaining variable has a significant effect on our regression variable, it is also important to observe that the R2 score is high which suggests this column is an important factor for our regression.

3. Fuel Type:
The fuel type column is a categorical column made up of four different categories, X = regular gasoline, Z = premium gasoline, D = diesel, E = ethanol (E85). Their frequency is represented below.

Following the same approach as we did above, we will remove the category D and E due to their low frequency and test how the remaining two factors are associated with the regression variable.

While both of these have a significant p-value, we can see that the R2 value is extremely low, indicating that this column does not represent the variance in our regression variable in a significant scale. Furthermore, as this column is also a categorical variable that has a nominal relationship, we will apply data transformation on this too and convert it using dummy variables.

4. Smog Rating
Smog rating: the tailpipe emissions of smog-forming pollutants rated on a scale from 1 (worst) to 10 (best). Finally, we will visualize our classification label to judge how it is distributed in the dataset.

Having an imbalanced target variable in a classification problem is not uncommon and can be okay depending on the context and the objectives of the analysis. It is important to consider the potential impact on the performance of the machine learning model and to take appropriate measures to mitigate any biases or issues that may arise. In such cases, it may be necessary to use techniques like oversampling or undersampling to balance the classes or to use evaluation metrics that account for the class imbalance, such as the F1 score, precision, recall, or ROC-AUC score. So, if the model’s performance is not satisfactory, we will revisit this section and explore alternative approaches to improve the results.

3.2 Numerical Columns

In this section we will visualize each numerical column to identify patterns, outliers, missing values, and potential errors in the data.

Engine Size:
Engine size is a continuous variable that has a range from 1.4-6.4. The distribution of the range is represented below.

The histogram shows the distribution of the variable, while the density plot shows the distribution in the form of a continuous line. Additionally, we are adding a red line to the plot which represents a normal distribution with mean and standard deviation equal to the mean and standard deviation of the variable Engine.Size.L. Although the dataset is right skewed, we will not remove outliers as they represent the behaviors of a real world scenario, and we want our model to be trained enough to handle such scenarios. Upon further analyzing the outliers, we assess that the observations with an engine size of 6, which are represented as outliers, have multiple rows representing the same vehicle type and company. This indicates that these values of engine size are not anomalies or errors in the dataset, they are present as a result of insufficient data.

2. Fuel Consumption
This section represents the combined fuel consumption of each car in L/100km. Fuel consumption is an important factor in calculating CO2 emissions because the amount of CO2 produced is directly proportional to the amount of fuel consumed. The more fuel consumed, the more CO2 emitted into the atmosphere. We will further prove this using a one way anova testing.

With a p-value significantly less than 0.05 and R2 value of 0.9992, we can conclude that fuel consumption is significant and it is an important factor for our regression analysis.

Moreover, the range of the column is 4.0–19.8 and to asses the distribution, we visualize the data in the graph below.

From this, we can conclude that the data is normally distributed. We can observe a slight tail occurring after an approximate value of 16, however, that was concluded in the previous subsection as a result of less variables being present in the data.

3. CO2 Emission
This section represents the distribution of target regression variable. The range of this column is 94–465. We can observe the distribution of the column in the figure below.

Analyzing this, we can observe that the distribution is normally distributed, with a slight tail following a similar pattern of the distributions above. It is important to note that this tail might disappear in a different iteration of the project.

3.3 Correlation

Before we proceed to our regression and classification analysis, we want to see how our final variables are correlated with each other. To map this out, we will build a heatmap. Note: Since smog rating is represented as 1=worst to 7=best, a negative score in this case will mean a positive correlation.

From here we can observe that Cylinders, Fuel Consumption, and Engine Size are highly correlated with each other and our target regression variable. For our classification label, all these factors share almost the same correlation revolving around ~0.49 to 0.60.

Regression

We will perform regression analysis to develop a predictive model for CO2 emissions based on input factors such as engine size, cylinders, fuel consumption, gear, transmission, and fuel type. By doing so, we can train our model to accurately predict the environmental impact of each car based on its features. This will provide valuable insights into the influence of each factor on carbon emissions. To do this, we will use the following regression models:

Linear Regression
SVR
GBM

Before we conduct our regression analysis, we will perform anova testing to asses how the input variables are associated with our target variable.

ANOVA
Before we conduct the regression analysis, we perform the ANOVA testing to identify how each input variable affects our CO2 Emission. The results of the anova table are:

From the results presented above, we can conclude that Gear has no significant impact on our regression analysis. Furthermore, our model was showing a warning since two or more variables were highly correlated with each other, causing an issue of multicollinearity. In order to solve this problem, we removed the Cylinders column from our training and conducted our analysis. Note: We ran the regression multiple times using single and combination variables of Cylinders, Engine Size, and Fuel Consumption and found the best results with the combination Fuel Consumption and Engine Size.

Models

To conduct regression we will go with Logistic Regression, SVM, GBM. The results of each regression are shown in the table below.

Based on the results, we choose Linear Regression as our best model. Since the train/test ratio is less than 1 and we have performed k-fold validation, we can conclude that the model is not overfitting and we can move forward with our analysis.

Confidence Interval
Calculating the confidence intervals of regression coefficients can strengthen our analysis in several ways. The confidence interval serves two important purposes. Firstly, it gives us an idea of how precise our estimate is for each coefficient. A wider interval means that our estimate is less precise, and we should have less confidence in its accuracy. Conversely, a narrower interval indicates a more precise estimate. Secondly, the confidence interval helps us to identify statistically significant predictors. If the interval for a coefficient does not include zero, we can conclude that the predictor variable is significantly associated with the response variable at the chosen level of significance. We will now calculate the confidence interval for our final regression model to compute its performance.

Here, we can observe that apart from Transmission A, AM, and AV, none of the confidence intervals include zero, indicating that all the remaining predictor variables have a statistically significant association with the response variable. We can observe that Fuel Consumption has the highest magnitude while Engine size also has a very narrow interval, indicating they both are significant factors, along with Fuel. TypeX to predict CO2 Emissions using this model.

Classification

We will also conduct classification analysis to train a model that predicts the smog rating of a car based on its input factors. Smog rating is a useful metric for consumers as it provides an easy-to-understand categorization of a car’s environmental impact. By training our model on a dataset of cars with known smog ratings and their corresponding input factors, we aim to create a model that accurately predicts a car’s smog rating based on its features. This will enable consumers to make more informed decisions about the environmental impact of their car purchases. To perform classification, we will use the following classification models:

Logistic Regression
Decision Tree
Random Forest
Naive Bayes

Once we filter out our best performing regression model, we compute the predicted CO2 Emission and add that to our classification training data. Similarly to regression, we first perform ANOVA testing to assess the association between input and target variables.

ANOVA
We perform ANOVA testing using the multinom() regression function. The results of our analysis are shown below:

This table shows the results of a type II ANOVA (analysis of variance) test for each independent variable in a multiple regression model with Smog Rating as the dependent variable. Upon observing these, we can see that CO2 Emissions and Fuel Consumption have no significance on the target variable Smog Rating. Since we were expecting different results, based on our domain knowledge, we will not remove these variables for now. We perform the logistic regression and observe that when we run the model without these variables, the overall accuracy decreases by ~8%. Coming back to this, we conclude that the reason these variables have a high p-value is because they are highly correlated, as calculated in the correlation section. As a result, both of these are giving out a high value. To solve this problem, we remove the Fuel Consumption from the input variables and get the updated ANOVA table:

Now, we can observe that CO2 Emissions are highly correlated with our target variable. Transmission now also has association. We will now move forward with these input variables.

Models
To build an accurate classification model, we will take into account the input variables specified by ANOVA model and use the following classification models:

Logistic Regression

Decision Trees

Random Forest

The averesults for each of the models are classified in the table below:

Looking at the results, we can see that all four models — Logistic Regression, Decision Trees, and Random Forest — are showing varying levels of performance across different classes. For example, the sensitivity and precision values are quite different for each class within each model, indicating that the models are handling different classes in different ways.

In general, it appears that the Random Forest model is performing the best overall, with the highest accuracy, sensitivity, and precision values. This model seems to be handling all classes relatively well, although the precision values for Class 3 and Class 5 could be improved. The Logistic Regression model also performs well, but with lower accuracy, sensitivity, and precision values than Random Forest. The Decision Trees and Naive Bayes models appear to be struggling more with classification, with lower accuracy and higher NaN values for precision.

ROC
For our ideal model, we will not plot the ROC curve to assess how well the model fits each of our classes:

Since the ROC curve is closer to the top-left corner of the plot, it means the classifier has higher sensitivity (ability to correctly identify positive cases) and specificity (ability to correctly identify negative cases), and therefore a higher area under the curve (AUC) value. We can confirm this by assessing the AUC values which contribute to the performance of our model.

Evaluation

In this section we will answer our second research question, Do all companies produce vehicle classes with the same smog rating?

We will answer this question by grouping the average smog and co2 by vehicle class and assess the results. Note that a small p-value (< 0.05) will indicate strong evidence against the null hypothesis and suggests that there is a significant difference in the average smog rating and CO2 emission between the companies within each vehicle class.

Smog Rating

CO2 Emission

From these results we can observe that p-value for vehicle class and make are both less than 0.05, indicating that there is a difference between the average smog and co2 emission of each vehicle type per company. Based on the results, we reject the null hypothesis.

Analysis

Research Questions

Which factors contribute to the CO2 Emission and Smog Rating the most?

Null hypothesis (H0): There is no significant difference in CO2 Emission and Smog Rating between different input variables

Alternative hypothesis (Ha): There is a significant difference in CO2 Emission and Smog Rating between at least two types input variables

From the results of the anova table represented above, we can conclude that for Regression: Fuel Consumption, Cylinders, and Engine Size have the most impact on CO2 Emissions. Similarly, for Classification: Fuel Combustion and CO2 Emissions have the most significant impact. These statements are recorded as they, in each case, have the lowest p-value in our ANOVA tests. So, we reject the null hypothesis.

Do all companies produce vehicle classes with the same smog rating?

a. Null hypothesis (H0): There is no significant difference in the mean of CO2 Emission and Smog Rating between different companies for the same vehicle types.

b. Alternative hypothesis (Ha): At least the mean of two companies for CO2 Emission and Smog Rating differs per vehicle type.

As per the results of our Evaluation section in the Results, we can reject the null hypotheses, Furthermore, we will visualize our predicted data. Based on the figure presented on the next page, we can observe that for the same vehicle types, we companies do not have the same co2 emissions or smog rating. So, we reject the null hypothesis.

Summary

Consumers can easily find information about their cars online, including the engine size, number of cylinders, fuel combustion type, transmission type, gears, and fuel type. Armed with this information, they can use the predictive model developed in this study to estimate the environmental impact of their vehicles. By doing so, consumers can make more informed decisions about their transportation choices and take steps to reduce their carbon footprint.

Furthermore, this study can be used to assess how different companies producing the same type of vehicle can produce cars with different CO2 emissions and smog ratings. By using this model, we can compare the environmental impact of cars produced by different companies and identify which ones are more environmentally friendly. Consumers can use this information to make more informed decisions when purchasing a vehicle, and companies can use it to improve their production processes and create more sustainable cars. Image attached below is an example of how this analysis can be conducted

Conclusion

In conclusion, the model we developed to predict CO2 emissions and smog ratings can be a useful tool in making informed decisions about the environment. By identifying the top-performing vehicles in terms of emissions, we can encourage the use of those vehicles and discourage the use of the worst performers. This, in turn, can help to reduce the overall impact of transportation on the environment.

However, it is important to note that our model has certain limitations. For example, there are other factors besides fuel type that can contribute to the assessment of a vehicle’s environmental impact, such as hybrid type, battery type, drive mode, and drivetrain. Therefore, the results of this analysis should be taken as a starting point rather than a comprehensive evaluation.

Additionally, while our dataset contained a large number of observations, it was limited in certain respects. For example, we only covered two fuel types (gasoline and diesel), whereas there are others (such as electric and hydrogen fuel cell). Moreover, the distribution of companies in each vehicle type was not equal, which could skew the results. In order to improve our analysis, it would be beneficial to have a larger dataset with more observations that includes a wider range of fuel types and a more equal distribution of companies in each vehicle type.

Overall, while our model has limitations, it provides a useful framework for evaluating the environmental impact of different vehicles. By taking into account multiple factors, we can help to reduce the impact of transportation on the environment and make more informed decisions about the vehicles we use.

The R code for the project is available on my GitHub if you would like to follow up on the code step-by-step, and offer insights, recommendations, and code improvement.