Multiple Linear Regression Using Python

9 min readNov 2, 2018

### Introduction

Multiple Linear Regression is a simple and common way to analyze linear regression. The model is often used for predictive analysis since it defines the relationship between two or more variables.

Multiple linear regression used to be a lot easier, only taking a couple of variables into consideration. Nowadays, we need to take a large variety of variables into consideration, and we have to decide which variables to use opposed to which variables to throw out.
A logical question would be:

> “Why do we throw out variables? More variables mean more data, and more data means a more accurate prediction?”

There are two main reasons why we would want to exclude some variables:

- **Garbage in, garbage out** — If you throw in a lot of stuff into your model then your model will not be a good model. It won’t be reliable, it won’t be doing what it’s supposed to be doing. Is going to be a garbage model.

- **Too many variables** — At the end of the day, you’re going to have to explain these variables and understand them. You will have to explain that to your executives. If you have a thousand variables is not going to be practical to try and explain that. You want to keep only the very important ones, the ones that actually predict something.

How do we construct a model? This is the process of building the model and selecting the right variables and we’ll be discussing today:
- Backward Elimination

### Dataset

There is a venture capital fund that has hired you as a data scientist to analyze 30 companies. Your task is to analyze this dataset and create a model that will tell the venture capital fund which types of companies it is most interested in investing — Their main criteria is profit:

That being said, *profit* is a dependent variable.

We’re given a task to create a model that gives us information about profit based Salary, Administration, Marketing and State.

What they’re looking for is a sample to understand, for instance, where companies perform better — Germany or Norway? Will a company that spends more on marketing perform better, or a company which spends less on marketing?

Basically, you’re helping them create a business model based on this sample that will allow them to assess where and into which companies they want to invest to achieve their goal of maximizing profit.

#### Data Preprocessing Template

#### Importing the Libraries
```python
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
```
#### Importing the Dataset
```python
dataset = pd.read_csv(‘Vc_Startups.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
```

We have to first import the Python libraries and our dataset for analysis.

Since we’re looking to calculate profit based on a few other dependencies and we’re familiar with the file — we know what indexes to use to capture the data we need. We’re taking everything from Marketing to the State, but we don’t want to include the profit. We’ll use the profit from the `.csv` file to validate whether our model is accurate in its prediction.

We are ready to build our matrix of independent variables `X` and our dependent variable vector `Y`.

**Note:** Categorial variables must be encoded. If they’re kept in literal text format, they might cause issues in the machine learning models equations. Our *State* variable is in text format, and this can’t go into the algorithm.

We’ll use the `LabelEncoder` to encode this column into numbers, but in order to remove any relational order, we’ll use the `OneHotEncoder` to create dummy variables.

#### Encoding Categorical Data
```python
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
labelencoder = LabelEncoder()
X[:, 3] = labelencoder.fit_transform(X[:, 3])
onehotencoder = OneHotEncoder(categorical_features = [3])
X = onehotencoder.fit_transform(X).toarray()
```

We only have one categorical variable which is the **State** variable and therefore it’s an independent variable. We don’t need to do any encoding of the dependent variables because they won’t cause any problems.

The encoding was successfully completed and all our columns are of the same type — the type of the matrix, a 64-bit integer.

Our fourth column, the *State* column was replaced with three new columns, which are the dummy variables and each column corresponds to one state.

#### Avoiding the Dummy Variable Trap
```python
X = X[:, 1:]
```
This line of code removes the first column from X. This way, we select all columns of X starting from index 1, not 0. We’re skipping the first column since it’s not an actual data piece we need — it’s the title of the column.

This kind of data is referred to as “dummy variable data”. In our case, we won’t need to do this manually, since the Python library for linear regression is taking care of the dummy variable trap for us.

This example is included just to remind you to keep the trap in your mind since not all frameworks and libraries take care of it for us.

#### Splitting the Dataset
At this point, we’re ready to make the “split”. We’ll split the dataset into the training set and the test set.

There’s no official rule to follow when deciding on a split proportion, though in most cases you’d want about 70% to be dedicated for the training set and around 30% for the test set:
```python
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
```

Even though our framework takes care of feature scaling for us, here’s an example of how it looks like if you’d like to take care of it yourself:
#### Feature Scaling
```python
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
sc_y = StandardScaler()
y_train = sc_y.fit_transform(y_train)
```

The library we’ve imported is the same as for a simple linear regression. By running this piece of code, we’ve created an object of the class `LinearRegression`. A *Regressor* is an object of this very class.

At this point, we’re supposed to fit this object to our training set. Fitting an object to a training set means that we need to create the *Regressor* object and then run the `.fit_transform()` method on it.

#### Fitting Multiple Linear Regression to the Training set
```python
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
```
#### Predicting the Test Set Results

And now, we can test the performance of the model in a separate set — the test set:
```python
y_pred = regressor.predict(X_test)
```

We’ll compare two columns:

- **y_test** — Column which contains the real profits in the test.

- **y_pred** — Column which contains the predicted profit. It is the vector of predictions.

Now we’re going to compare some of the profits because each of the lines here corresponds to the same observation.

<table style=”table table-striped”>
<thead>
<tr>
<th>Prediction</th>
<th>Actual Value</th>
<th>Comment</th>
</tr>
</thead>
<tbody>
<tr>
<td>$175076</td>
<td>$191040</td>
<td>A pretty good prediction.</td>
</tr>
<tr>
<td>$99103</td>
<td>$103298 </td>
<td>Congratulations, that’s very close!</td>
</tr>
<tr>
<td>$90550</td>
<td>$108442</td>
<td>Anoher good prediction.
</td>
</tr>
</tbody>
</table>

**Note:** However, there are some better ways of evaluating our model performance. We can use Automatic Backward Elimination. if you are also interested in some automatic implementations of Backward Elimination in Python, please find one of them below:

### Building the Optimal Model Using Backward Elimination
```python
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((30, 1)).astype(int), values = X, axis = 1)
X_opt = X[:,[0, 1, 2, 3, 4, 5]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
```

However, there are some better ways of evaluating our model performance. When we made this model, we actually used all independent variables. Some of the independent variables are highly statistically significant and some are not statistically significant. We will build the optimal model with only highly statistically significant variables using Backward Elimination.

Before we get into Backward Elimination, make sure to be introduced to the `p-value` and have a basic understanding of how it works. Understanding the p-value will really help you deepen your understanding of hypothesis testing in general.

Before I talk about what the p-value is, let’s talk about what it isn’t.

The p-value is NOT the probability the claim is true. Of course, this would be an amazing thing to know! Think of it “there is a 10% chance that this medicine works”. Unfortunately, this just isn't the case. Actually determining this probability would be really tough if not impossible!
The p-value is NOT the probability the null hypothesis is true. Another one that seems so logical it has to be right! This one is much closer to the reality, but again it is way too strong of a statement. We need to go back to Step 3 of Backward Elimination to find the independent variable that has the highest `p-value`.
The p-value is actually the probability of getting a sample like ours, or more extreme than ours IF the null hypothesis is true. So, we assume the null hypothesis is true and then determine how “strange” our sample really is. If it is not that strange (a large p-value) then we don’t change our mind about the null hypothesis. As the p-value gets smaller, we start wondering if the null really is true and well maybe we should change our minds (and reject the null hypothesis).

For more information look this video it explains p-value really well.

Then we need to compare it to the significance level to decide if we need to remove this independent variable with the highest `p-value`. If the `p-value` is greater than significant level(0.05), we will remove it.

In the first case variable `X5` has the highest `p-value` (0.674) and we will remove this variable.

```python
X_opt = X[:,[0, 1, 2, 3, 4 ]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
```

In the next step, we can see that variable `X4` has the highest `p-value` (0.813) and we will remove this variable. We will keep doing this until we get the optimal variables with the highest significant value.

```python
X_opt = X[:,[0, 1, 2, 3]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
```

We got our optimal model. The optimal team of independent variables that can predict a profit with the highest statistical significance is actually composed of three independent variables — *Salary*, *Marketing* and *State*.

if you are also interested in some automatic implementations of Backward Elimination in Python, please find two of them below:

### Automatic Backward Elimination

#### Backward Elimination with p-values only
```python
import statsmodel. formula.api as sm
X = np.append(arr = np.ones((30, 1)).astype(int), values = X, axis = 1)
def backwardElimination(X, sl):
numVars = len(X[0])
for i in range(0, numVars):
regressor_OLS = sm.OLS(y, X).fit()
maxVar = max(regressor_OLS.pvalues).astype(float)
if maxVar > sl:
for j in range(0, numVars — i):
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
X = np.delete(x, j, 1)
regressor_OLS.summary()
return X
SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)
```

#### Backward Elimination with p-values and Adjusted R Squared:
```python
import statsmodels.formula.api as sm
X = np.append(arr = np.ones((30, 1)).astype(int), values = X, axis = 1)
def backwardElimination(X, SL):
numVars = len(X[0])
temp = np.zeros((30,6)).astype(int)
for i in range(0, numVars):
regressor_OLS = sm.OLS(y, X).fit()
maxVar = max(regressor_OLS.pvalues).astype(float)
adjR_before = regressor_OLS.rsquared_adj.astype(float)
if maxVar > SL:
for j in range(0, numVars — i):
if (regressor_OLS.pvalues[j].astype(float) == maxVar):
temp[:,j] = X[:, j]
X = np.delete(x, j, 1)
tmp_regressor = sm.OLS(y, X).fit()
adjR_after = tmp_regressor.rsquared_adj.astype(float)
if (adjR_before >= adjR_after):
X_rollback = np.hstack((X, temp[:,[0,j]]))
X_rollback = np.delete(X_rollback, j, 1)
print (regressor_OLS.summary())
return X_rollback
else:
continue
regressor_OLS.summary()
return X

SL = 0.05
X_opt = X[:, [0, 1, 2, 3, 4, 5]]
X_Modeled = backwardElimination(X_opt, SL)
```
### Conclusion
We can see that our model did a quite good job. There is a multiple linear dependency between the independent variables and the dependent variable. There is a strong linear relationship because we were able to fit a linear model to our dataset of multiple variables. You now know how to make a multiple linear regression model on Python.

Source: https://www.kaggle.com/farhanmd29/50-startups

Multiple Linear Regression Using Python

Written by Manja Bogicevic