From Theory to Practice: Implementing Multilinear Regression in Python — A step by step guide for beginners

Hadia Butt
11 min readFeb 17, 2024

--

Photo by David Pupăză on Unsplash

Regression is one of the most popular predictive modelling machine learning algorithms used to predict future outcomes based on varying relationships between dependent and independent variables. This article aims to illustrate the step by step process used to conduct a simple multilinear regression analysis in Python.

The article assumes basic familiarity with python, jupyter notebooks and the fundamental regression concepts. In case you are not familiar with these preliminaries, you can refer to the links at the bottom of this article to bring yourself up to speed with the basics.

This analysis has been conducted on Ford Cars dataset that can be accessed here. The dataset consists of approximately 18,000 observations/rows and 9 columns containing a mix of categorical and numerical data regarding features of different Ford cars. Overall, the core objective of the analysis was to determine if there existed a statistically significant relationship between the independent variables such as fuel type, transmission, engine size, mileage etc. and the dependent variable price and subsequently to quantify the impact of such features/explanatory variables on the response variable price.

Step 1: Loading the required libraries

As the first step, a jupyter notebook was created and the essential python libraries were loaded. At this point, only the libraries relevant for data exploration were imported including pandas, numpy and matplotlib. The read_excel function was used to read the Ford dataset in the notebook.

Step 2: Exploratory Data Analysis (EDA)

As the next step of the analysis, data exploration was done to understand the data with regards to its statistical properties. However, before performing EDA, some structure based pre-processing and cleaning was done. This included removal of white spaces, conversion of data into correct data types, removal of duplicate entries, filling missing data and treating outliers.

You can find the detailed article on how to perform these exact steps involved in data preparation here.

Once the data was appropriately cleaned, the next step was to get a basic summary of descriptive statistics to understand the patterns in the data. The summary table was built to view the descriptive statistics such as mean, standard deviation, median, minimum, and maximum values. These are as follows:

Following insights were drawn from the summary table:

  1. engineSize had a mean value of 1.35 and a low standard deviation of 0.39 indicating that the data was mostly centered around the mean. As the median was less than the mean value, it meant that our data was right-skewed. Moreover, initially, there was a large difference between the 75th percentile and the max value indicating presence of outliers which was fixed.
  2. The mean price of used cars in the dataset was $12,197 while the standard deviation was $4459 indicating that once again, data was centered around the mean. Median value of price was slightly less than the mean which meant data on price was right skewed as well.
  3. The mean mileage of cars was 22454 while the standard deviation was quite high at 16421 meaning that the data was spread out and there was some variation.
  4. Tax column had an average of 135 and a low standard deviation of 21.09 which meant the data was mostly spread around the mean.
  5. Mpg column had a mean of 57.9 and low standard deviation of 9.7 which meant most of the data was spread around the mean value. The mean and median values were close to each other implying that the data was symmetrical.

After an assessment of the numerical variables, the 3 categorical variables were analyzed: transmission type, fuel type and model type. Following insights were gathered from the charts:

  1. Fiesta was the most frequently occurring model followed by Focus, Kuga and EcoSport.
#Frequency of model
plt.figure(figsize=(15,5))
#sns.color_palette("Set2")
sns.countplot(x = 'model', data = df, order = df['model'].value_counts().index, palette= "husl", edgecolor='black')
plt.xticks(rotation=60,fontsize=12)
plt.xlabel("Car Model",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.show()

2. Approximately 15,000 cars had a manual transmission, 1300 cars had an automatic transmission while 1000 cars had a semi-auto transmission.

#bar plot for transmission
plt.figure(figsize=(10,5))
sns.countplot(x='transmission', data=df, order = df['transmission'].value_counts().index, palette='husl',edgecolor='black')
plt.xticks(rotation=0,fontsize=12)
plt.xlabel("Transmission Type",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.show()

3. About 12,000 cars had petrol as their fuel type while 6,000 had diesel as their fuel type. Cars with hybrid and electric fuel type were very less. Therefore, these were combined into one variable “other” such that we only had 3 fuel types: petrol, diesel and others.

#combine electirc and hybrid cars are other
df['fuelType'] = df['fuelType'].replace(['Hybrid', 'Electric','Other'], 'Other')

#bar plot for Fuel Type
plt.figure(figsize=(10,5))
sns.countplot(x='fuelType', data=df,palette="husl", edgecolor='black')
plt.xticks(rotation=0,fontsize=12)
plt.xlabel("Fuel Type",fontsize=15)
plt.ylabel("Frequency",fontsize=15)
plt.show()

Next, bivariate analysis was conducted and correlation matrix was built to understand the relationship between the numerical variables.

The following insights were drawn from the heat map above:

  1. Price had positive correlation with engineSize and a lower positive correlation with tax whereas it had a higher negative correlation with Mileage and lower negative correlation with mpg.
  2. engineSize had a lower positive correlation with mileage and tax and a lower negative correlation with mpg.
  3. Fuel type had an impact on price. Cars with fuelType as Diesel were slightly more expensive than petrol.
  4. Transmission type also affected the car price. Automatic cars were the most expensive while cars with Semi-Auto and Manual transmissions were less expensive.

Step 3: Building the Regression Model

Once the exploratory data analysis was done, the next step was to build a regression model. A multilinear regression model was built as number of independent variables was more than 1. The dependent variable was price while variables including engineSize, mileage, mpg, fuelType and transmission were used as our independent variables. Dummy variables were also created for categorical data such as fuel type and transmission. Additionally, the data was divided in training and test sets with test size as 30%.

At this point, relevant libraries required to run a regression analysis in python were loaded. Essentially, the sklearn and the statsmodel libraries were used.

Additionally, in order to include categorical variables in our regression model, they were converted into dummy variables using the pandas get_dummies function. As fuelType and transmission were the categorical variables that had to be encoded, these were used in the get_dummies function.

It is important to note that one of the dummy variables for each category was dropped before including the variables in the regression model to avoid multicollinearity amongst these variables. This was because keeping all three levels for both categories (petrol, diesel and others for fuelType and automatic, manual and semi-auto for transmission) would have provided no extra information to us.

Model 1:

The first model, linear_model1, was built using both our numerical and categorical variables. The statsmodels library with the Ordinary Least Squares module was used. The data was also split into training and test sets to measure prediction accuracy later on. The model was built on the training set and tested on the test set. Training set included 70% of the data and the test set included 30% of the data.

#Train-Test Split
x_train, x_test, y_train, y_test = train_test_split(regression_data1, regression_data1['price'],test_size=0.3, random_state=0)
#building the multilinear regression model
linear_model1 = smf.ols(formula='price ~ engineSize + mileage + mpg + tax + fuelType_Petrol + fuelType_Diesel + transmission_Automatic + transmission_Manual',
data = x_train).fit()
print(linear_model1.summary())

Interpretation:

Our first regression model indicated a high adjusted R-squared value of 0.625. This meant that our model explained approximately 63% of variance in the prices. The value of our F-statistic was high at 2594 and it was statistically significant with p-value < 0.05 indicating that our model was able to fit the data well.

The coefficients of intercept and our independent variables also appeared statistically significant as their p-values were < 0.05 except for the variable transmission_Automatic which had a p-value of 1.62.

As we dropped the variables fuelType_Other and transmission_Semi-auto, their combined impact on price was captured in the intercept which meant that the estimated mean price of a car with fuelType as other and transmission as semi-auto was 23130. This estimate was significant as the p-value was < 0.05.

Model coefficients:

  1. The coefficient of engineSize was 2215 which indicated that with every unit increase in engine size the price of car would go up by 2215.
  2. The coefficient of a car with fuelType_Diesel was -4040 indicating that the estimated mean price of cars with diesel fuel type was 4040 less than the estimated mean price of cars with fuel type as other. Thus, the estimated mean price of Diesel car was 23130–4040 = 19,090. This made intuitive sense as the ‘other’ category included electric and hybrid cars which are generally more expensive.
  3. Similarly, the coefficient of a car with transmission as Manual was -302 indicating that the estimated mean price of cars with transmission as Manual was 302 less than the mean price of cars with transmission as semi-auto. Thus, the estimated mean price of a manual cars was 23130–302 = 22828. This also made intuitive sense as cars with manual transmission are less expensive than cars with automatic or semi-auto transmissions.

Regression Equation:

Our regression model’s equation was as follows:

Prediction Accuracy:

One of the major uses of regression analysis are their ability to predict

Next, the prediction accuracy of the model was calculated using metrics such as Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Squared Error (MSE).

We will calculate the accuracy for both training and test sets.

The training set had a Mean Absolute Error of 2158 and a Mean squared Error of 7575629. The Root Mean Squared Error was 2752 which meant that on average, the model predictions of price are off by $2752.

The test set had a Mean Absolute Error of 2121 and a Mean squared Error of 7273331. The Root Mean Squared Error was 2696 which meant that on average, the model predictions of price in the unseen test data are off by $2696. As this is close to the metrics for training set, it’s a good model.

Next, the variable transmission_Automatic was dropped as it wasn’t statistically significant to see if the value of our adjusted R-squared improves.

Model 2:

#building the multilinear regression model
linear_model2 = smf.ols(formula='price ~ engineSize + mileage + mpg + tax + fuelType_Diesel + fuelType_Petrol + transmission_Manual',
data = x_train).fit()
print(linear_model2.summary())

Interpretation:

The 2nd regression model indicated that our adjusted R-squared did not improve when we removed the variable transmission_Automatic from our model. This meant that our model still explained approximately 63% of variance in the prices. However, the value of our F-statistic has improved to 2963 and it was statistically significant with p-value < 0.05 indicating that our model was able to fit the data better as compared to model 1.

The coefficients of intercept and all our independent variables also appeared statistically significant as their p-values were < 0.05.

The intercept or the estimated mean price of a car with fuelType as other and transmission as semi-auto was now 23310. This estimate was significant as the p-value was <0 .05.

Model coefficients:

  1. The coefficient of engineSize was 2217 which indicated that with every unit increase in engine size the price of car would go up by 2217.
  2. The coefficient of a car with fuelType_Diesel was -4107.3 indicating that the estimated mean price of cars with diesel fuel type was 4107.3 less than the estimated mean price of cars with fuel type as other. Thus, the estimated mean price of Diesel car was 23310–4107.3 = 19,202. This made intuitive sense as ‘other’ category included electric and hybrid cars which are generally more expensive.
  3. Similarly, the coefficient of a car with transmission as Manual was now -422.6 indicating that the estimated mean price of cars with transmission as Manual was 422 less than the mean price of cars with transmission as semi-auto. Thus, the estimated mean price of a manual cars was 23310–422.6 = 22887.4.

Regression Equation:

Our regression model’s new equation was as follows:

Prediction Accuracy:

Similarly, the error metrics were used once again to check the prediction accuracy of this model.

The training set with model 2 had a Mean Absolute Error of 2158 and a Mean squared Error of 7577242. The Root Mean Squared Error is also 2752 which means that on average, the model predictions of price were off by $2752. These were also close to the metircs for model 1.

The test set had a Mean Absolute Error of 2121 and a Mean squared Error of 7273331. The Root Mean Squared Error was 2696 which meant that on average, the model predictions of price in the unseen test data were off by $2696. As this was close to the metrics for training set, it’s a good model. The values are also similar to those of Model 1 which means both models had similar accuracy.

At this point, we stopped the iterations of models considered Model 2 as our final model.

Limitations and Recommendations

One key limitation of the models created was that even when statistically non-significant variable was removed, the value for adjusted R-squared did not improve. While F-statistic improved slightly, the overall adjusted R-sqaured of 0.63 is not a sufficiently large given that a good adjusted R-squared lies between 0.8–0.9.

Additional data points such as the age of the car or customer purchase decision could have been used to predict the probability of a customer purchasing a used car.

Nonetheless, such a regression model can be used by stakeholders like car dealerships to appropriately price used cars such that cars with bigger engine sizes could be priced higher than those with smaller engine sizes. The quantified impact of other variables can also be used to take pricing decision.

To further assist business managers, additional machine learning technique of K-means clustering could be used to create segments of customers based on car prices so that targeted campaigns could be run highlighting the right product features for the right customer group.

--

--