Definition:
Prediction algorithms are used to forecast future events based on historical data.

Car Price Prediction Data:
Problem Statement:
A Chinese automobile company “Geely Auto” aspires to enter the US market by setting up their manufacturing unit there and producing cars locally to give competition to their US and European counterparts. Specifically, they want to understand the factors affecting the pricing of cars in the American market, since those may be very different from the Chinese market. The company wants to know:
Which variables are significant in predicting the price of a car, how well those variables describe the price of a car.
Business Goal:
We are required to model the price of cars with the available independent variables so that the company can accordingly manipulate the design of the cars, the business strategy etc; to meet certain price levels based on the fitted model. Here we do this by applying a Linear Regression model.

Description of the Data set:
The data set contains information regarding the various factors influencing the price of a particular car. There are a total of 26 columns/attributes like car name, fuel type, engine location, horsepower, peak rpm, city mpg etc also with the output variable/attribute price. There are a total of 205 observations.
Dataset:
To view and download the dataset and data dictionary click here.
Attribute Information:
1)Car_ID: Unique id of each observation (Integer)
2)Symboling: Its assigned insurance risk rating, A value of +3 indicates that the auto is risky, -3 that it is probably pretty safe. (Categorical)
3)carCompany: Name of a car company (Categorical)
4)fueltype: Car fuel type i.e gas or diesel (Categorical)
5)aspiration: Aspiration used in a car (Categorical)
6)doornumber: Number of doors in a car (Categorical)
7)carbody: the body of the car (Categorical)
8)drivewheel: type of drive wheel (Categorical)
9)enginelocation: Location of a car engine (Categorical)
10)wheelbase: Wheelbase of a car (Numeric)
11)carlength: Length of the car (Numeric)
12)carwidth: Width of the car (Numeric)
13)carheight: height of car (Numeric)
14)curbweight: The weight of a car without occupants or baggage. (Numeric)
15)enginetype: Type of engine. (Categorical)
16)cylindernumber: cylinder placed in the car (Categorical)
17)enginesize: Size of the car (Numeric)
18)fuelsystem: Fuel system of car (Categorical)
19)boreratio: Boreratio of a car (Numeric)
20)stroke: Stroke or volume inside the engine (Numeric)
21)compressionratio: compression ratio of a car (Numeric)
22)horsepower: Horsepower (Numeric)
23)peakrpm: car peak rpm (Numeric)
24)citympg: Mileage in the city (Numeric)
25)highwaympg: Mileage on highway (Numeric)
26)price: Price of the car (Numeric)(Dependent variable)
# Importing necessary packages and functions required
import numpy as np # for numerical computations
import pandas as pd # for data processing,I/O file operations
import matplotlib.pyplot as plt # for visualization of different kinds of plots
%matplotlib inline
# for matplotlib graphs to be included in the notebook, next to the code
import seaborn as sns # for visualization
import warnings # to silence warnings
warnings.filterwarnings('ignore')# import all libraries and dependencies for data visualization
pd.options.display.float_format='{:.4f}'.format
plt.rcParams['figure.figsize'] = [8,8]
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_colwidth', -1)
sns.set(style='darkgrid')# import all libraries and dependencies for machine learning
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score# Reading the automobile consulting company file on which analysis needs to be done
df_auto = pd.read_csv("...\\CarPrice_Assignment.csv")
df_auto.head()

# Understanding the shape of the data frame
df_auto.shape(205, 26)
This shows that there are 205 rows, 26 columns in the data.
# information of the data set
df_auto.info()<class 'pandas.core.frame.DataFrame'>
RangeIndex: 205 entries, 0 to 204
Data columns (total 26 columns):
car_ID 205 non-null int64
symboling 205 non-null int64
CarName 205 non-null object
fueltype 205 non-null object
aspiration 205 non-null object
doornumber 205 non-null object
carbody 205 non-null object
drivewheel 205 non-null object
enginelocation 205 non-null object
wheelbase 205 non-null float64
carlength 205 non-null float64
carwidth 205 non-null float64
carheight 205 non-null float64
curbweight 205 non-null int64
enginetype 205 non-null object
cylindernumber 205 non-null object
enginesize 205 non-null int64
fuelsystem 205 non-null object
boreratio 205 non-null float64
stroke 205 non-null float64
compressionratio 205 non-null float64
horsepower 205 non-null int64
peakrpm 205 non-null int64
citympg 205 non-null int64
highwaympg 205 non-null int64
price 205 non-null float64
dtypes: float64(8), int64(8), object(10)
memory usage: 41.7+ KB# summary statistics of data set
df_auto.describe()

Cleaning the Data
What is Data Cleaning?
Data comes in all forms, most of it being very messy and unstructured. They rarely come ready to use. Data sets, large and small, come with a variety of issues- invalid fields, missing and additional values, and values that are in forms different from the one we require. To bring it to workable or structured form, we need to “clean” our data, and make it ready to use. Some common cleaning includes parsing, converting to one-hot, removing unnecessary data, etc.

We need to do some basic cleansing activity to feed our model the correct data.
# Dropping car_ID as it is just for reference and is of no use.
df_auto = df_auto.drop('car_ID',axis=1)# Missing Values % contribution in df
df_null = df_auto.isna().mean().round(4) * 100
df_null.sort_values(ascending=False)price 0.0000
carheight 0.0000
CarName 0.0000
fueltype 0.0000
aspiration 0.0000
doornumber 0.0000
carbody 0.0000
drivewheel 0.0000
enginelocation 0.0000
wheelbase 0.0000
carlength 0.0000
carwidth 0.0000
curbweight 0.0000
highwaympg 0.0000
enginetype 0.0000
cylindernumber 0.0000
enginesize 0.0000
fuelsystem 0.0000
boreratio 0.0000
stroke 0.0000
compressionratio 0.0000
horsepower 0.0000
peakrpm 0.0000
citympg 0.0000
symboling 0.0000
dtype: float64# Outlier Analysis of target variable
plt.figure(figsize = [8,8])
sns.boxplot(data=df_auto['price'], orient="v", palette="Set2")
plt.title("Price Variable Distribution", fontsize = 14, fontweight = 'bold')
plt.ylabel("Price Range", fontweight = 'bold')
plt.xlabel("Continuous Variable", fontweight = 'bold')
plt.show()

# Extracting Car Company from the CarName
df_auto['CarName'] = df_auto['CarName'].str.split(' ',expand=True)df_auto.head(5)

# Checking for Unique Car companies
df_auto['CarName'].unique()array(['alfa-romero', 'audi', 'bmw', 'chevrolet', 'dodge', 'honda',
'isuzu', 'jaguar', 'maxda', 'mazda', 'buick', 'mercury',
'mitsubishi', 'Nissan', 'nissan', 'peugeot', 'plymouth', 'porsche','porcshce', 'renault', 'saab', 'subaru', 'toyota', 'toyouta','vokswagen', 'volkswagen', 'vw', 'volvo'], dtype=object)
We find that car company names there are some typing mistakes which we will rename correctly as follows:
1)maxda = mazda2) Nissan = nissan3) porsche = porcshce4) toyota = toyouta5) vokswagen = volkswagen = vw
# Renaming the typing errors in Car Company names
df_auto['CarName'] = df_auto['CarName'].replace({'maxda': 'mazda', 'nissan': 'Nissan', 'porcshce': 'porsche', 'toyouta': 'toyota',
'vokswagen': 'volkswagen', 'vw': 'volkswagen'})# changing the datatype of symboling from integer to string as it is categorical variable as per the dictionary file
df_auto['symboling'] = df_auto['symboling'].astype(str)# To check if there are duplicates present in the data set
df_auto.loc[df_auto.duplicated()]

# Segregation/Seperation of Numerical and Categorical Variables/Columns in the data set
cat_col = df_auto.select_dtypes(include=['object']).columns
num_col = df_auto.select_dtypes(exclude=['object']).columns
df_cat = df_auto[cat_col]
df_num = df_auto[num_col]df_cat.head()
df_num.head()

print(df_cat.shape)
print(df_num.shape)(205, 11)
(205, 14)
Visualizing the Data
# Visualizing number of cars for each car name in the data set
plt.figure(figsize = [15,8])
ax=df_auto['CarName'].value_counts().plot(kind='bar',stacked=False, colormap = 'rainbow')
ax.title.set_text('Carcount')
plt.xlabel("Names of the Car",fontweight = 'bold')
plt.ylabel("Count of Cars",fontweight = 'bold')
plt.show()
# Visualizing the distribution of car prices
plt.figure(figsize=(8,8))plt.title('Car Price Distribution Plot')
sns.distplot(df_auto['price'])
plt.show()

# Pair plot for all the numeric variables
ax = sns.pairplot(df_auto[num_col])
# Box plot for all the categorical variables
plt.figure(figsize=(20, 15))
plt.subplot(3,3,1)
sns.boxplot(x = 'doornumber', y = 'price', data = df_auto)
plt.subplot(3,3,2)
sns.boxplot(x = 'fueltype', y = 'price', data = df_auto)
plt.subplot(3,3,3)
sns.boxplot(x = 'aspiration', y = 'price', data = df_auto)
plt.subplot(3,3,4)
sns.boxplot(x = 'carbody', y = 'price', data = df_auto)
plt.subplot(3,3,5)
sns.boxplot(x = 'enginelocation', y = 'price', data = df_auto)
plt.subplot(3,3,6)
sns.boxplot(x = 'drivewheel', y = 'price', data = df_auto)
plt.subplot(3,3,7)
sns.boxplot(x = 'enginetype', y = 'price', data = df_auto)
plt.subplot(3,3,8)
sns.boxplot(x = 'cylindernumber', y = 'price', data = df_auto)
plt.subplot(3,3,9)
sns.boxplot(x = 'fuelsystem', y = 'price', data = df_auto)
plt.show()
# Visualizing some more variables
plt.figure(figsize=(25, 6))plt.subplot(1,3,1)
plt1 = df_auto['cylindernumber'].value_counts().plot('bar')
plt.title('Number of cylinders')
plt1.set(xlabel = 'Number of cylinders', ylabel='Frequency of Number of cylinders')plt.subplot(1,3,2)
plt1 = df_auto['fueltype'].value_counts().plot('bar')
plt.title('Fuel Type')
plt1.set(xlabel = 'Fuel Type', ylabel='Frequency of Fuel type')plt.subplot(1,3,3)
plt1 = df_auto['carbody'].value_counts().plot('bar')
plt.title('Car body')
plt1.set(xlabel = 'Car Body', ylabel='Frequency of Car Body')
plt.show()

Derived Metrics:
We will use the mean of the car prices(“Average Price”) and visualize some variables using it.
plt.figure(figsize=(50, 5))
df_autox = pd.DataFrame(df_auto.groupby(['CarName'])['price'].mean().sort_values(ascending = False))
df_autox.plot.bar()
plt.title('Car Company Name vs Average Price')
plt.show()
plt.figure(figsize=(20, 6))
df_autoy = pd.DataFrame(df_auto.groupby(['carbody'])['price'].mean().sort_values(ascending = False))
df_autoy.plot.bar()
plt.title('Carbody type vs Average Price')
plt.show()
#Binning the Car Companies based on avg prices of each car Company using groupby and merge functions
df_auto['price'] = df_auto['price'].astype('int')
df_auto_temp = df_auto.copy()
t = df_auto_temp.groupby(['CarName'])['price'].mean()
df_auto_temp = df_auto_temp.merge(t.reset_index(), how='left',on='CarName')
bins = [0,10000,20000,40000]
label =['Budget_Friendly','Medium_Range','TopNotch_Cars']
df_auto['Cars_Category'] = pd.cut(df_auto_temp['price_y'],bins,right=False,labels=label)
df_auto.head()
Significant variables after Visualization
We find the following variables to be significant after all the visualizations Cars_Category, Engine Type, Fuel Type, Car Body, Aspiration, Cylinder Number, Drive wheel, Curb weight, Car Length, Car width, Engine Size, Boreratio, Horse Power, Wheelbase, citympg, highwaympg.
sig_col = ['price','Cars_Category','enginetype','fueltype', 'aspiration','carbody','cylindernumber', 'drivewheel',
'wheelbase','curbweight', 'enginesize', 'boreratio','horsepower',
'citympg','highwaympg', 'carlength','carwidth']df_auto = df_auto[sig_col]df_auto.shape(205, 17)
Data Preparation
Dummy Variables
We need to convert the categorical variables to numeric. For this, we will use something called dummy variables.
sig_cat_col = ['Cars_Category','enginetype','fueltype','aspiration','carbody','cylindernumber','drivewheel']# Get the dummy variables for the categorical feature and store it in a new variable - 'dummies'
dummies = pd.get_dummies(df_auto[sig_cat_col])
dummies.shape(205, 29)# To get k-1 dummies out of k categorical levels by removing the first level.
dummies = pd.get_dummies(df_auto[sig_cat_col], drop_first = True)
dummies.shape(205, 22)# Add the results to the original dataframe
df_auto = pd.concat([df_auto, dummies], axis = 1)
df_auto.shape(205, 39)df_auto.sample(5)

# Drop the original cat variables as dummies are already created
df_auto.drop( sig_cat_col, axis = 1, inplace = True)
df_auto.shape(205, 32)
Splitting the Data into Training and Testing Sets
df_auto.sample(10)
# We specify this so that the train and test data set always have the same rows, respectively
df_train, df_test = train_test_split(df_auto, test_size = 0.3, random_state = 100)
# We divide the dataframe into 70/30 ratioprint(df_train.shape)
print(df_test.shape)(143, 32)
(62, 32)
Rescaling the Features
For Simple Linear Regression, scaling doesn’t impact model. So it is extremely important to rescale the variables so that they have a comparable scale. If we don’t have comparable scales, then some of the coefficients as obtained by fitting the regression model might be very large or very small as compared to the other coefficients. There are two common ways of rescaling:
- Min-Max scaling
- Standardisation (mean-0, sigma-1)
Here, we will use Standardisation Scaling.
scaler = preprocessing.StandardScaler()sig_num_col = ['wheelbase','carlength','carwidth','curbweight','enginesize','boreratio','horsepower','citympg','highwaympg','price']# Apply scaler() to all the columns except the 'dummy' variables
df_train[sig_num_col] = scaler.fit_transform(df_train[sig_num_col])df_train.sample(5)

# Checking the correlation coefficients to see which variables are highly correlated
plt.figure(figsize = (25, 20))
sns.heatmap(df_train.corr(), cmap="RdYlBu",annot=True)
plt.show()
Scatter plot for few correlated variables vs price.
col = ['highwaympg','citympg','horsepower','enginesize','curbweight','carwidth']# Scatter Plot of independent variables vs dependent variables
plt.figure(figsize=(17, 10))
plt.subplot(2,3,1)
sns.scatterplot(x = 'highwaympg', y = 'price', data = df_auto)
plt.subplot(2,3,2)
sns.scatterplot(x = 'citympg', y = 'price', data = df_auto)
plt.subplot(2,3,3)
sns.scatterplot(x = 'horsepower', y = 'price', data = df_auto)
plt.subplot(2,3,4)
sns.scatterplot(x = 'enginesize', y = 'price', data = df_auto)
plt.subplot(2,3,5)
sns.scatterplot(x = 'curbweight', y = 'price', data = df_auto)
plt.subplot(2,3,6)
sns.scatterplot(x = 'carwidth', y = 'price', data = df_auto)
plt.show()

# Dividing into X and Y sets for model building
y_train = df_train.pop('price')
X_train = df_trainy_train.sample(2)66 0.6797
96 -0.7143
Name: price, dtype: float64X_train.sample(2)

# Shapes of X_train,y_train
print(X_train.shape)
print(y_train.shape)(143, 31)
(143,)
Building a linear model
# Building a simple linear model with the most highly correlated variable enginesize
X_train_1 = X_train['enginesize']
# Add a constant
X_train_1c = sm.add_constant(X_train_1)
# Create a first fitted model
lr_1 = sm.OLS(y_train, X_train_1c).fit()# Check parameters created
lr_1.paramsconst 0.0000
enginesize 0.8679
dtype: float64# Let's visualise the data with a scatter plot and the fitted regression line
plt.scatter(X_train_1c.iloc[:, 1], y_train)
plt.plot(X_train_1c.iloc[:, 1], 0.8679*X_train_1c.iloc[:, 1], 'g')
plt.show()

# Print a summary of the linear regression model obtained
print(lr_1.summary())OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.753
Model: OLS Adj. R-squared: 0.752
Method: Least Squares F-statistic: 430.5
Date: Fri, 01 Nov 2019 Prob (F-statistic): 1.09e-44
Time: 13:09:05 Log-Likelihood: -102.84
No. Observations: 143 AIC: 209.7
Df Residuals: 141 BIC: 215.6
Df Model: 1
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 8.674e-17 0.042 2.07e-15 1.000 -0.083 0.083
enginesize 0.8679 0.042 20.748 0.000 0.785 0.951
==============================================================================
Omnibus: 23.258 Durbin-Watson: 1.990
Prob(Omnibus): 0.000 Jarque-Bera (JB): 32.411
Skew: 0.885 Prob(JB): 9.16e-08
Kurtosis: 4.520 Cond. No. 1.00
==============================================================================Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
With simple linear regression i.e., enginesize and price we get adjusted R square value of 75%.
Adding more variables
The adjusted R-squared value obtained is 0.75. Since we have so many variables, let’s add the other highly correlated variables, i.e. curbweight, horsepower.
X_train_2 = X_train[['enginesize','horsepower', 'curbweight']]
# Add a constant
X_train_2c = sm.add_constant(X_train_2)
# Create a second fitted model
lr_2 = sm.OLS(y_train, X_train_2c).fit()lr_2.paramsconst 0.0000
enginesize 0.3400
horsepower 0.2288
curbweight 0.3938
dtype: float64print(lr_2.summary())OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.819
Model: OLS Adj. R-squared: 0.815
Method: Least Squares F-statistic: 209.7
Date: Fri, 01 Nov 2019 Prob (F-statistic): 2.16e-51
Time: 13:09:07 Log-Likelihood: -80.681
No. Observations: 143 AIC: 169.4
Df Residuals: 139 BIC: 181.2
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 9.021e-17 0.036 2.5e-15 1.000 -0.071 0.071
enginesize 0.3400 0.083 4.114 0.000 0.177 0.503
horsepower 0.2288 0.064 3.589 0.000 0.103 0.355
curbweight 0.3938 0.073 5.385 0.000 0.249 0.538
==============================================================================
Omnibus: 25.598 Durbin-Watson: 1.805
Prob(Omnibus): 0.000 Jarque-Bera (JB): 55.392
Skew: 0.751 Prob(JB): 9.37e-13
Kurtosis: 5.653 Cond. No. 4.60
==============================================================================Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The adjusted R-squared increased from 0.75 to 0.81
Considering all 13 correlated variables as from the correlation Heat map
The adjusted R-squared value obtained with 3 highly correlated variables is 0.81. Since we have so many correlated variables, we can do better than this. So lets consider all the 13 highly correlated variables in order(from high to low), i.e.,(positively correlated-9) enginesize,curbweight,horsepower,carwidth,Cars_Category_TopNotch_Cars,carlength,drivewheel_rwd,(negatively correlated-4) drivewheel_fwd,cylindernumber_four,citympg,highwaympg and fit the multiple linear regression model.
X_train_3 = X_train[['enginesize', 'curbweight','horsepower', 'carwidth','Cars_Category_TopNotch_Cars','carlength','drivewheel_rwd','drivewheel_fwd','cylindernumber_four','citympg','highwaympg']]
# Add a constant
X_train_3c = sm.add_constant(X_train_3)
# Create a third fitted model
lr_3 = sm.OLS(y_train, X_train_3c).fit()lr_3.paramsconst 0.0121
enginesize 0.0427
curbweight 0.1971
horsepower 0.1961
carwidth 0.1642
Cars_Category_TopNotch_Cars 1.1336
carlength 0.0480
drivewheel_rwd 0.1203
drivewheel_fwd -0.0262
cylindernumber_four -0.2338
citympg 0.0738
highwaympg -0.0423
dtype: float64print(lr_3.summary())OLS Regression Results
==============================================================================
Dep. Variable: price R-squared: 0.919
Model: OLS Adj. R-squared: 0.912
Method: Least Squares F-statistic: 134.8
Date: Fri, 01 Nov 2019 Prob (F-statistic): 8.62e-66
Time: 13:09:09 Log-Likelihood: -23.354
No. Observations: 143 AIC: 70.71
Df Residuals: 131 BIC: 106.3
Df Model: 11
Covariance Type: nonrobust
===============================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------------------
const 0.0121 0.134 0.090 0.928 -0.253 0.277
enginesize 0.0427 0.073 0.585 0.560 -0.102 0.187
curbweight 0.1971 0.104 1.895 0.060 -0.009 0.403
horsepower 0.1961 0.064 3.048 0.003 0.069 0.323
carwidth 0.1642 0.059 2.764 0.007 0.047 0.282
Cars_Category_TopNotch_Cars 1.1336 0.114 9.935 0.000 0.908 1.359
carlength 0.0480 0.066 0.728 0.468 -0.082 0.179
drivewheel_rwd 0.1203 0.124 0.973 0.332 -0.124 0.365
drivewheel_fwd -0.0262 0.130 -0.202 0.841 -0.283 0.231
cylindernumber_four -0.2338 0.081 -2.888 0.005 -0.394 -0.074
citympg 0.0738 0.127 0.580 0.563 -0.178 0.325
highwaympg -0.0423 0.126 -0.336 0.738 -0.292 0.207
==============================================================================
Omnibus: 49.295 Durbin-Watson: 2.033
Prob(Omnibus): 0.000 Jarque-Bera (JB): 144.942
Skew: 1.324 Prob(JB): 3.36e-32
Kurtosis: 7.160 Cond. No. 21.6
==============================================================================Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
We have achieved adjusted R-squared of 0.91 by manually picking the highly correlated variables.
Making Predictions Using the Final Model
Now that we have fitted the model, it’s time to go ahead and make predictions using the final model.
# Applying the scaling on the test sets
df_test[sig_num_col] = scaler.transform(df_test[sig_num_col])
df_test.shape(62, 32)# Dividing test set into X_test and y_test
y_test = df_test.pop('price')
X_test = df_test# Adding constant
X_test_1 = sm.add_constant(X_test)X_test_new = X_test_1[X_train_3c.columns]# Making predictions using the final model
y_pred = lr_3.predict(X_test_new)
Model Evaluation
Let’s now plot the graph for actual versus predicted values.
# Plotting y_test and y_pred to understand the spread.
fig = plt.figure()
plt.scatter(y_test,y_pred)
fig.suptitle('y_test vs y_pred', fontsize=18)
plt.xlabel('y_test ', fontsize=15)
plt.ylabel('y_pred', fontsize=15)
plt.show()
R Square value:
r2_score(y_test, y_pred)0.8926458920564233
The R2 score of the training set is 0.91 and the test set is 0.89 which is very much close. Hence, we can say that our model is good enough to predict Car prices with the above variables.
Equation of Line to predict the Car prices values
Carprice=0.0121+0.0427×enginesize+0.1971×curbweight+0.1961×horsepower+0.1642×carwidth+1.1336×Cars_Category_TopNotch+0.0480×carlength+0.1203×drivewheel_rwd-0.0262×drivewheel_fwd-0.2338×cylindernumber_four+0.0738×citympg-0.0423×highwaympg.
Github gist
https://gist.github.com/SAMEERA-DS/51eca626e2460aaf9bb3b9584a8d375f
