Week 3: Predicting CO2 emissions

Built a Machine Learning model that uses linear regression to predict the amount of polution an engine car could emit

Published in

Joguei os Dados

6 min readMay 6, 2020

We’re on third week of Pyrentena and time just flew by! After messing around with avocado sales dataset, I decided to dig in into a much more serious theme and start some Machine Learning predictions. The dataset of this week is available on Kaggle and contains data of cars and its CO2 emission. We have some features such as vehicle class, model, size of the engine, model year, cylinders… I have to say: cars ain’t really my biggest interest, but helping the environments definitely is! Machine Learning to help prevent environmental disasters, for me, is the coolest thing ever and I dream of working with something related to this in the future. Let’s take a look at the dataset:

This dataset contain 13 columns and 1067 rows. After doing some research, I found out that the power of the engine, named here as ENGINESIZE, is the feature that have a direct impact on the levels of CO2 emissions. After importing some libraries, I started ploting some graphs to see what this relationship looked like and which ML model would work on this situation:

# ploting graph engine size x CO2 emissions plt.figure(figsize=(13,5)) 
sns.lineplot(x=df['ENGINESIZE'], y=df['CO2EMISSIONS']) plt.xlabel('Motor engine') 
plt.ylabel('CO2 emissions') 
plt.show()

The lineplot showed a positive relation between the size/power of the engine motor and the carbon emission. With some variation, I could say that the bigger the engine, the greater the levels of CO2 emited. I already knew this could be a nice dataset to try some linear regression model, and ploting this graph started to confirm this information.

Spliting data to train the model

I needed more libraries to start building the model! After importing them, I assigned the features I was interest in in two variables:

# importing necessary libraries from sklearn import linear_model 
from sklearn.metrics import r2_score, mean_squared_error,mean_absolute_error 
from sklearn.model_selection import train_test_split# features into variables engine= df[['ENGINESIZE']] 
co2 = df[['CO2EMISSIONS']]

I was ready to split the data into ‘train’ and ‘test’ to train my model and see If could predict well enough. To do that, I used train_test_split and ploted the correlation between this two variables in the TRAIN dataset:

# spliting data in train and test with train_test_split engine_treino, engine_test, co2_treino, co2_test = train_test_split(engine, co2, test_size=0.2, random_state=42)# ploting the correlation between features on train datasetplt.scatter(engine_treino, co2_treino, color='blue')
plt.xlabel('engine')
plt.ylabel('co2 emission')
plt.show()

Ploting the correlation on train dataset resulted on the graphic above. Despite having quite some residual, a linear regression line was able to be drawn — and that is what I did next:

Creating the model with the train dataset

The linear regression formula is Y = A + B*X. This means that the model had to figure out the values of ‘A’ and ‘B’ in order to predict the CO2 emission ‘Y’ when given an engine size ‘X’. Time to build the model and train it to find this two coeficients:

# creating a linear regression model
# LinearRegression is a method of sklearnmodelo = linear_model.LinearRegression()# linear regression formula: (Y = A + B.X) 
# training the model to obtain the values of A and B (always do it in the train dataset) modelo.fit(engine_treino, co2_treino)# exibiting the coeficients A and B that the model generated print(f'(A) intercept: {modelo.intercept_} | (B) inclination: {modelo.coef_}')# output(A) intercept: [126.28970217] | (B) inclination: [[38.99297872]]

There! I have my A and B coeficients according to the model trained with linear regression algorithm. Seemed a good idea to plot the linear regression line before applying the model on the TEST dataset to see It’s accuracy:

Executing the model on test dataset and evalutating the results

I decide to plote the ‘correlation graphic’ between the two features on the test dataset to see how different would be from graphic 4:

# print linear regression line on our TEST datasetplt.scatter(engine_test, co2_test, color='green')
plt.plot(engine_test, modelo.coef_[0][0]*engine_test + modelo.intercept_[0], '-r')
plt.ylabel('CO2 emissions')
plt.xlabel('Engine')
plt.show()

You can se that the datapoints are a bit more spread out, but the reltationship was there. Time to evaluate our model! I decided to print the following metrics (hold on tight, statiscs is coming!):

-Sum fo squared error (SSE): sum all of the residuals and square them to better acuracy.

-Mean of squared error (MSE): it gets the average of SSE.

-Sqrt of mean squared error (RMSE): is the sqrt of MSE

-r2-score: explains the variance of the variable Y when it comes to X.

# Showing metrics to check the acuracy of our modelprint(f'Sum of squared error (SSE): {np.sum((predictCO2 - co2_test)**2)}') 
print(f'Mean squared error (MSE): {mean_squared_error(co2_test, predictCO2)}') 
print (f'Sqrt of mean squared error (RMSE):  {sqrt(mean_squared_error(co2_test, predictCO2))}') 
print(f'R2-score: {r2_score(predictCO2, co2_test)}') # output

Sum of squared error (SSE): CO2EMISSIONS    210990.768215 
dtype: float64 
Mean squared error (MSE): 985.9381692274999 
Sqrt of mean squared error (RMSE):  31.399652374309813 
R2-score: 0.6782015355440534

r2-score is the easiest one to understand when it comes to evaluating the model, but is always good to calculate other statiscal metrics too.
All of the metrics above help evaluate the acuracy of the model! r2, for instance, is 0.68: this means that our linear regression model (values A and B given) is able to explain 68% of the variance between the CO2 emission and engine of the cars. Is nice to mention that the usual benchmark for this metric is 0.70, so It is a quite satisfactory result!

How this could help real life

This ML model could be of use for the consumers that want to buy their next car. Since you insist on driving, how about picking one that has a lower environmental impact?! Could be deployed as a simple webpage (maybe using Flask?) where you could type the cars you are interested in and the website returns the amount of polution the vehicle emits.

It could also bring some fun convertions of what this amount of CO2 really ‘means’, such as: ‘This car emits enough CO2 to fill 100 ballons per minute’, or maybe cross the data with another dataset on CO2 emissions and Its capacity to melt polar ice caps. Can you imagine the impact it could have on the consumer If our model deployed could say: ‘This cars melts 200g of polar ice caps per minute’?! I would definitely have second thoughts on buying it, that is for sure.

Have another idea of how to deploy this ML model? Wanna help me make this happen? Please, message me and let’s get to work!
You can check the complete notebook with this solution on my Github.