Multi-Variate Linear Regression

Jagajith
CodeX
Published in
6 min readSep 12, 2021
Photo by Blake Wheeler on Unsplash

Understanding what happens behind the scenes of popular libraries like sci-kit for implementing various machine learning algorithms is one of the most difficult parts of any data scientist’s journey. So, I came up with yet another machine learning algorithm today.In this post, we’ll get insights into the method, code it, and then apply it as a prediction model.

What is Multi-Variate Linear Regression?

Multi-Variate linear regression(Linear Regression on Multiple Variables) is similar to the simple linear regression model or Uni-Variate linear regression model (Click here if you haven’t checked my blog on Uni-variate linear regression), but with multiple independent variables contributing to the dependent variable, resulting in many coefficients to calculate and a more complicated computation due to the additional variables.

Model Representation

Notations

Loading Data

Suppose you are selling your house and you want to know what a good market price would be. One way to do this is to first collect information on recent houses sold and make a model of housing prices. Consider a dataset contains housing prices in Portland, Oregon.

Let’s import the required libraries and load the dataset into a Pandas Dataframe:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
house_data = pd.read_csv('ex1data2.txt', header=None)
house_data.head()
Output: house_data.head()

Here the first column is the size of the house(in square feet), the second column is the number of bedrooms, and the third column is the price of the house. Here n = 2 (i.e, size and no.of.bedrooms)

house_data.describe()
Output: data.describe()

Feature Normalization

Before we begin working on a problem, we must first examine and analyse the data. This step appears to be simple at first appearance, but if not done correctly, it can be really painful.

What is the purpose of data normalization(feature scaling)?
That’s because some of our features may be in the 0–1 range, while others may be in the 0–1000 range. If you feed the data as is, you risk getting a bad fit.

Without Feature Scaling

From the above image we can see that, without feature scaling, our contour will be more elongated, causing the gradient descent process to find the minimum after a large number of iterations, slowing down our approach and making it difficult to identify the minimum some times.

With Feature Scaling

From the above image, we can find that after feature scaling, gradient descent finds minimum with small number of steps. This speeds up our algorithm.

We are going to use Mean Normalization to scale our features in similar scale,

Replace xᵢ with xᵢ-μᵢ, to make features have approximately zero mean(Do not apply to x₀)

Mean Normalization

Example:

Before feature scaling
After feature scaling
def featureNormalize(X):
mean = np.mean(X, axis=0)
std = np.std(X, axis=0)

X_norm = (X - mean)/std
return X_norm, mean, std
mod_house_data = house_data.values
m2 = len(mod_house_data[:, 2])
X2 = mod_house_data[:, :2].reshape(m2, 2)
y2 = mod_house_data[:, 2].reshape(m2, 1)
X2, mu, sigma = featureNormalize(X2)
X2 = np.column_stack((np.ones((m2, 1)), X2))
After Normalization

Hypothesis for Multi-Variate Regression

In multi-variate regression we will use multiple variables to predict the output, so our hypothesis is,

hypothesis

Here, θᵗ(Theta transpose) is a row vector containing all values of θᵢ (size of ℝⁿ⁺¹, n=no.of.features/columns)and X(Note: Here ‘X’ represents a matrix or column vector) is column vector (size of ℝⁿ⁺¹)containing all values of xᵢ(Note: Here, ‘xᵢ’ represents each columns in dataset).

Cost Function

The goal is to set the parameters so that h(x) is close to the values of y for each x. For example, choose θ₀ and θ₁ so that h(x) is close to the values of y for each x.
This condition can be expressed numerically as follows:

Cost Function

To know more about cost function(Click here to know more about cost function in my previous post).

def compute_cost(X, y, theta):
m = len(y)
h_theta = X.dot(theta)
J = 1/(2*m) * np.sum((h_theta-y)**2)
return J
theta2 = np.zeros((3, 1))
compute_cost(X2,y2,theta2)

Gradient Descent on Multi-Variate Regression

In our earlier post, we have implemented gradient descent on a uni-variate regression. The only difference now is that there is more than one feature in matrix X. (To know more about gradient descent click here)

Update rule

Selecting Learning Rates

Alpha or learning rate decides the size of steps that the algorithm takes to reach the minimum cost value(i.e., the values of parameters which give the minimum cost value).

Now, we will try different learning rates for the dataset and find a learning rate that converges quickly.

When α is small gradient descent takes more time to converge, meanwhile when α is large it overshoots and it fails to converge. We can find a good learning rate(α) by trying out different values of α. I recommend trying values of α on a log-scale, at multiplicative steps about 3 times the previous value(i.e., 0.3, 0.1, 0.03, 0.01 and so on).

lr = [0.01, 0.03, 0.09, 0.1, 0.3, 0.9]
J_histories = []
for x in lr:
theta2, J_history = gradientDescent(X=X2, y=y2, theta=theta2, alpha=x, n_iters=100, graph=False)
J_histories.append(J_history)

Visualizing J(θ) for different learning rates,

len_J = len(J_histories[0])
for x in range(len(lr)):
plt.plot(J_histories[x], label=lr[x])
plt.xlabel("No. of Iterations")
plt.ylabel("J(theta)")
plt.title("Cost function using Gradient Descent")
plt.title('Learning rates')
plt.legend(loc=1)
J(θ) for different learning rate(α)

From the above graph, we can see that α=0.01 is the good choice. So, we train our model using 0.01 as learning rate.

theta2, J_history2 = gradientDescent(X2, y2, theta2, 0.01, 400)
print(f"h(x) ={round(theta2[0,0],2)}+{round(theta2[1,0],2)}x1 + {round(theta2[2,0],2)}x2")

Predictions

Formula remains unchanged,

def predict(X, theta):
predictions = np.dot(theta.T, X)
return predictions[0]

Don’t forget to normalize the values before using it to make predictions,

x_sample = featureNormalize(np.array([1650, 3]))[0]
x_sample = np.append(np.ones(1),x_sample)
predict3 = predict(x_sample, theta2)
print(f”For size of house = 1650, Number of bedroom = 3, we predict a house value of ${round(predict3,0)}”)

Output: For size of house = 1650, Number of bedroom = 3, we predict a house value of $430447.0

Conclusion

Today, we saw the concepts behind hypothesis, cost function, and gradient descent of Multi-Variate regression. It was then created from scratch using python’s numpy, pandas and matplotlib. The dataset and final code is uploaded in github.

Check it out here Linear Regression.

If you like this post, then check out my other posts in this series about

1. What is Machine Learning?

2. What are the Types of Machine Learning?

3. Uni-Variate Linear Regression

4. Logistic Regression

5. What are Neural Networks?

6. Digit Classifier using Neural Networks

7. Image Compressing with K-means Clustering

8. Dimensionality Reduction on Face using PCA

9. Detect Failing Servers on a Network using Anomaly Detection

Last Thing

If you enjoyed my article, a clap 👏 and a follow would be 📈regressive� and it is helpful for medium to promote this article so that others may read it. I am Jagajith and I will catch you in the next one.

--

--