Analytics Vidhya
Published in

Analytics Vidhya

Linear Regression A-Z (Using Car Price Prediction dataset)

Linear regression is a way to explain the relationship between a dependent variable(Y) and one or more explanatory variables(X) using a straight line. It is a special case of regression analysis. Linear regression was the first type of regression analysis to be studied rigorously.

It is one of the simplest, yet very powerful Algorithm if used for the right problem.

We will look at all the aspects of this algorithm :
1. Conditions for using Linear Regression
2. Assumptions
3. Objective
4. Equation for Linear Regression
5. Inferring the Different model outputs
6. Evaluating the Model

We will look at all these concepts practically using “Car Price Prediction” dataset as we go.

Linear Regression

1. Mandatory Conditions for using Linear Regression.

  • Target variable “Y” should be Numeric & Continuous.

2. Assumptions for Linear Regression.
*Note* : These are not hard and fast rules, just assumptions that will indicate everything will be fine for the model.

  • Assumption 1 : There is a Linear Relationship Between independent variables “X” and dependent variables “Y”.
  • Assumption 2 : Minimum collinearity amongst the independent variables “X”.
  • Assumption 3 : Error terms are Normally distributed and there are no Patterns among them.
  • Assumption 4 : Homoscedasticity :-Variance around the regression line is same for all independent variables “X”.

Once we build the Model on our Data we will check all these Assumptions. Now let’s get started with the Dataset.

  • As we can see, the Dataset consists of 205 Rows and 26 Columns.
    Column “Price” being our Target Variable(Y) and 25 columns like Fuel_type, wheel_base, engine_type etc which give information about various attributes of the car.

So our initial condition is Satisfied, as our Target variable is Numeric and Continuous we can move ahead and implement Linear Regression for it.

Preparing the Data

Performed Data Cleaning and preparation to make the data ready for Linear Regression. Data Had some issues like :
1. Had some weird character “?” in some columns, needed to find and deleted.
2. Deal with Missing Values :- Imputed Numeric Columns using Median and Categorical columns using Mode.
3. Converting Columns to correct Datatype :- The columns which had “?” were originally int but because of this weird characters they were present as Object.
4. Label Encoder :- To make the Data Numeric.
5. Sampling in Train and Test

I will not be explaining them here, we will only focus on Linear Regression. However, please find the link to this Notebook where all these steps have been Explained.

As our Data has been Preprocessed and made ready for the Model, let us check some Assumptions :

Assumption 1 : There is a Linear Relationship Between independent variables “X” and dependent variables “Y”. We will use Corr() method provided by Python to check this.

Correlation Between X and Y

Here the Highlighted cell shows the Correlation between engine_size and price. Similarly we can see Correlation between different X variables with Y i.e Price.
We can see that most of the X are correlated with Y i.e They have a Linear Relationship, so we can consider this Assumption as Satisfied.

Assumption 2 : Minimum multicollinearity amongst the independent variables “X”.

Correlation Matrix for all Variables.

From the figure we can conclude there is not an issue of multicollinearity amongst our X variables. So this Assumption hold true as well.

Assumption 3 and 4 can be checked after we build the model, so lets go ahead and build the Model. For now lets just calculate Error and check the Assumptions. We will look into Model performance Validation after this.

Standard Steps To Train the Model, Predict and Calculate the Error.

Assumption 3 : Error terms are Normally distributed and there are no Patterns among them.

Check Distribution of Errors.
Check for any Patterns in Errors.

The Distribution is looks to be Normal, not perfect but good Enough.
Also there don’t seem to be any pattern in the Errors. There are some outliers present, but I have dealt with them in the notebook, please refer the same if interested in that.
So we can say Assumption 3 is satisfied.

Assumption 4 : Homoscedasticity : Variance around the regression line is same for all independent variables “X”.

Check Variance around the Regression line.

The Blue Highlighted light is the variance around the line, There is some variance but it is fine to carry on as model is not going to be perfect. So, we can say Assumption 4 is satisfied.

As all the Assumptions are now True, we can be confident in saying that Linear Regression will perform well on the Given Dataset!!

Now, Lets take a step back and look at What Linear Regression really is, the Equation, Objective and how to infer the Model Output.

I waited for this Part till now and not earlier because now that our model is ready We can look into theory and the Practical output from our model and really understand what is happening.

Equation for Multivariate Linear Regression
  • - Y : Target Variable.
    - ß0 : Intercept of the Line.
    - ß0 : Regression coefficient.
    - Xi : Input Feature.
  • “ßi” Tells how much does Change of 1 unit of “Xi” affect the Target Variable “Y”.

The main Objective if to find ß0,ßi i.e the Best Fit Regression line.
The Best Fit Regression line is nothing but the line with the Least Error.

The Error function here used is MSE(Mean Square Error):-
(Error = ActualY -PredictedY)

Finding ß0,ßi from the Model.

We have used our model and got the Best fit line. And from that we can see the value of :-
ß0 : -55465.16338267594
ß1 : 1.94262470e+02 ß2 : -1.78293505e+01 ß3 : -3.11714543e+03 … ßn

We can predict New values using this Regression Line :
y = -55465.16338267594 + 1.94262470e+02*x1 + -1.78293505e+01*x2 .. + 5.44607468e+01*xn.
which is the same as : y = ß0 + ß1.x1 + ß2.x2 +….. + ßn.xn

We are now at the last step that is Evaluating the Model.

R-square :- One of the Most important aspect to evaluate the Model.

R-square : It measures the proportion of the variation in your dependent variable (Y) explained by your independent variables (X) for a linear regression model.

R-square for our Model.

Here the R-square = 0.9090492008852977. We Read this as Our model is able to Explain 90% of the variation that is caused on the “Y” variable.

Issue with R-square : Even if we add insignificant “X” Variables to the Model, the R-square score will increase.
Example : To predict the Mileage of a car if we add a variable “Colour_of_car”. Ideally this does not effect the Mileage at all but still R-square score will increase after adding this variable. Use Adjusted R-squared to overcome this.

Adjusted R-square :- Scores based on the number of independent variables in the model.

Adjusted R-square tells the same thing as R-square, but will penalize the Model if there are insignificant features added to the Model. Therefore overcoming the issue that is faced by R-square.

Adjusted R-Square :  0.8920807682146443

As the R-square and Adjusted R-square are very close, it means our Model does not really have any insignificant Features. So that is a Good thing.

Conclusion

  • My Main objective here was to explain the different aspects of Linear Regression only, from a Model building point there are many more steps involved. You can refer them in the Notebook here.
  • Please write to me in case of any queries/feedback : darekarabhishek@gmail.com

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store