# Linear Regression A-Z (Using Car Price Prediction dataset)

Linear regression is a way to explain the relationship between a dependent variable(Y) and one or more explanatory variables(X) using a straight line. It is a special case of regression analysis. Linear regression was the first type of regression analysis to be studied rigorously.

It is one of the simplest, yet very powerful Algorithm if used for the right problem.

We will look at all the aspects of this algorithm :

1. **Conditions **for using Linear Regression

2. **Assumptions**

3. **Objective**

4. **Equation **for** **Linear Regression

5. **Inferring **the Different model **outputs**

6. **Evaluating **the Model

We will look at all these concepts **practically **using “**Car Price Prediction**” dataset as we go.

1.

Mandatory Conditionsfor using Linear Regression.

- Target variable “Y” should be Numeric & Continuous.

2.

Assumptionsfor Linear Regression.

*Note* : These are not hard and fast rules, just assumptions that will indicate everything will be fine for the model.

**Assumption 1 :**There is a Linear Relationship Between independent variables “X” and dependent variables “Y”.**Assumption 2 :**Minimum collinearity amongst the independent variables “X”.**Assumption 3 :**Error terms are Normally distributed and there are no Patterns among them.**Assumption 4 :**Homoscedasticity :-Variance around the regression line is same for all independent variables “X”.

**Once we build the Model on our Data we will check all these Assumptions. Now let’s get started with the Dataset.**

- As we can see, the Dataset consists of
**205 Rows and 26 Columns.**

Column “**Price**” being our**Target Variable(Y)**and 25 columns like Fuel_type, wheel_base, engine_type etc which give information about various attributes of the car.

So our initial condition is Satisfied, as our Target variable is Numeric and Continuouswe can move ahead and implement Linear Regression for it.

Preparing the Data

Performed Data Cleaning and preparation to make the data ready for Linear Regression. Data Had some issues like :

1. Had some **weird character “?”** in some columns, needed to find and deleted.

2. Deal with **Missing Values** :- Imputed Numeric Columns using Median and Categorical columns using Mode.

3. **Converting **Columns to **correct Datatype **:- The columns which had “?” were originally int but because of this weird characters they were present as Object.

4. **Label Encoder** :- To make the Data Numeric.

5. **Sampling **in Train and Test

**I will not be explaining them here, we will only focus on Linear Regression. However, please find the link to this Notebook where all these steps have been Explained.**

As our Data has been Preprocessed and made ready for the Model, let us check some **Assumptions **:

Assumption 1: There is aLinear RelationshipBetween independent variables “X” and dependent variables “Y”. We will use Corr() method provided by Python to check this.

Here the Highlighted cell shows the Correlation between engine_size and price. Similarly we can see Correlation between different X variables with Y i.e Price.

We can see that **most of the X are correlated with Y i.e They have a Linear Relationship, so we can consider this Assumption as Satisfied.**

Assumption 2: Minimummulticollinearityamongst the independent variables “X”.

From the figure we can conclude there is not an issue of **multicollinearity **amongst our X variables. So **this Assumption hold true** as well.

**Assumption 3 and 4 can be checked after we build the model, so lets go ahead and build the Model. **For now lets just calculate Error and check the Assumptions. We will look into Model performance Validation after this.

Assumption 3:Error terms are Normally distributedand there areno Patternsamong them.

The Distribution is looks to be Normal, not perfect but good Enough.

Also there don’t seem to be any pattern in the Errors. There are some outliers present, but I have dealt with them in the notebook, please refer the same if interested in that.

So we can say **Assumption 3 is satisfied.**

Assumption 4:Homoscedasticity: Variance around the regression line is same for all independent variables “X”.

The Blue Highlighted light is the variance around the line, There is some variance but it is fine to carry on as model is not going to be perfect. So, we can say **Assumption 4** is satisfied.

As all the Assumptions are now True, we can be confident in saying that Linear Regression will perform well on the Given Dataset!!

Now, Lets take a step back and look at What Linear Regression really is, the

Equation,Objectiveand how to infer the Model Output.

**I waited for this Part till now and not earlier because now that our model is ready We can look into theory and the Practical output from our model and really understand what is happening.**

**- Y**: Target Variable.

-**ß0**: Intercept of the Line.

-**ß0**: Regression coefficient.

-**Xi**: Input Feature.**“ßi” Tells how much does Change of 1 unit of “Xi” affect the Target Variable “Y”.**

The main **Objective **if to find **ß0,ßi** i.e the **Best Fit Regression line**.

The Best Fit Regression line is nothing but the **line with the Least Error**.

The **Error function** here used is **MSE(Mean Square Error):**-

(Error = ActualY -PredictedY)

We have used our model and got the Best fit line. And from that we can see the value of :-**ß0 **: -55465.16338267594**ß1 **: 1.94262470e+02 ß2 : -1.78293505e+01 ß3 : -3.11714543e+03 … ßn

We can **predict **New values **using this Regression Line** :**y = -55465.16338267594 + 1.94262470e+02*x1 + -1.78293505e+01*x2 .. + 5.44607468e+01*xn.**

which is the same as : **y = ß0 + ß1.x1 + ß2.x2 +….. + ßn.xn**

We are now at the last step that is

Evaluatingthe Model.

R-square :-One of the Most important aspect to evaluate the Model.

**R**-**square : **It** **measures the proportion of the variation in your dependent variable (Y) **explained **by your independent variables (X) for a linear regression model.

Here the **R**-**square = **0.9090492008852977. We Read this as Our model is able to Explain 90% of the variation that is caused on the “Y” variable.

**Issue **with R-square : Even if we add insignificant “X” Variables to the Model, the R-square score will increase.**Example **: To predict the **Mileage **of a car if we add a variable “**Colour_of_car**”. Ideally this does not effect the **Mileage **at all but still R-square score will increase after adding this variable. Use **Adjusted R**-**squared **to overcome this.

Adjusted R-square :-Scores based on the number of independent variables in the model.

**Adjusted R**-**square **tells the same thing as **R**-**square, **but** **will penalize the Model if there are insignificant features added to the Model. Therefore overcoming the issue that is faced by **R**-**square.**

`Adjusted R-Square : 0.8920807682146443`

As the **R**-**square and Adjusted R**-**square **are very close, it means our Model does not really have any insignificant Features. So that is a Good thing.

Conclusion

- My Main objective here was to explain the different aspects of Linear Regression only, from a Model building point there are many more steps involved. You can refer them in the Notebook here.
- Please write to me in case of any queries/feedback : darekarabhishek@gmail.com