What is Linear Regression? Loss Funtion | Gradient Descent | Polynomial Regression with Python code

10 min readOct 7, 2022

Linear regression is a predictive modeling technique that helps us to build a model which can predict continuous response variables as a function of a linear combination of explanatory or predictor variables and target variables.

While building a linear regression model, the goal is to identify a linear equation that best predicts the relationship between the response variable and one or more predictor variables.

Preparing Data For Linear Regression

There are some rules of thumb for preparing our data before training our linear model. Try different preparations of your data and see what works best for your problem.

Linear Assumption: Linear regression assumes that the relationship between the input and the output variables is linear. You may need to transform data to make the relationship linear (e.g. log transform for an exponential relationship).
Remove Noise: Linear regression assumes that the input and output variables are not noisy. Consider using data cleaning operations that let you better expose and clarify the signal in your data. This is most important for the output variable and you want to remove outliers in the output variable (y) if possible.
Remove Collinearity: Linear regression will over-fit your data when you have highly correlated input variables. Consider calculating pairwise correlations for your input data and remove the most correlated.
Gaussian Distributions. Linear regression will make more reliable predictions if your input and output variables have a Gaussian distribution. You can get some benefit using transforms (e.g. log or BoxCox transforms) on your variables to make their distribution more Gaussian looking.
Rescale Inputs: Linear regression will often make more reliable predictions if you rescale input variables using standardization or normalization.

Types of linear regression models

There are different types of Linear Regression models. Let’s discuss them!

Simple or Univariate linear regression models

This type of linear regression models are used to build a linear relationship between one response variable (dependent variable) and one predictor variable (independent variable). The equation that represents a simple linear regression model is

ŷ is predicted variable
x represent input value of ith feature
β0 and β1 represent model parameters, where β0 is the bias term and β1 is feature weight. β0 is simply the intercept of the linear regression line while β1 is the slope of the line.

Multiple or Multi-variate linear regression models

This type of linear regression models are used to build a linear relationship between one dependent variable and more than one independent variable. The equation that represents a multiple linear regression model is:

ŷi is predicted variable
i represents number of features where i ranges from 1,2,…n
xi represent input value of ith feature
β0 and βi represent model parameters, where β0 is the bias term and βi is feature weight

Linear Regression Model Representation

From the above equations we can understand that, a linear model makes a prediction simply by computing a weighted sum of input features, plus one additional coefficient(bias), which gives the line an additional degree of freedom and is often called the intercept or the bias coefficient.

Now this linear equation ŷi = β0 + βiXi can be written concisely in vectorized form as,

Linear regression model prediction vectorized form

θ represents the model’s parameter vector, including θ0 the bias term and θi feature weights from θ1 to θn.
X is the instance’s feature vector, containing X0 to Xn, with X0 always equal to 1.
θ · X is the dot product of the vectors θ and X
hθ is the hypothesis function, using the model parameters θ.

Cost Function

While training the linear regression model, we are required to determine the coefficients θ that minimize the model’s cost function (error), which results in the best-fitted linear regression line.

Cost function quantifies the error between predicted value and actual(observed) value. It will determine accuracy level of our model.

Meaning? When our linear model makes predictions, all predictions might not be correct(actual value). The distance between the predicted value and the actual value will give us the amount of error our model has made. This is calculated as Loss Function. Sum of all the loss functions (for all wrong predictions) is theCost Function. It will tells us how good or bad our model is performing on the data we gave it.

How about a decription with a picture? Let’s check out this graph

Here Y is our target variable, Ŷ is predicted variable and X is the predictor(independent) variable. Let’s take a point (Xi, Yi), here ei is the amount of error that the linear model has made for predicting Yi value.

So, our job is to reduce such error, so that the linear regression line fits best with the actual target variables, for given input variables. Sometimes this regression line is also called as best fit line.

Cost Function Equation

Most common performance measures of linear regression model are Root Mean Square Error RMSE, Mean Square Error MSE. Let’s consider MSE as our Linear Regression Cost function. MSE is the mean of sum of all squared errors.

Yi is Actual value, Ŷi is predicted value, n is number of samples(rows in our dataset). We already know Yi = hθ(x) = θ . X

So, all we need is to find θ value which reduces the cost function to the minimum.

There are different optimizing techniques in Machine Learning to determine the coefficient θ in order to reduce the cost function. Let’s see what are they.

Techniques for Estimating Linear Regression Coefficients

1. Ordinary Least Squares

The Ordinary Least Squares(OLS) is an optimizing technique that minimizes the cost function and finds the best-fit line for input data. Here the cost fucntion is sum of squared residuals(Residual Sum of Squares RSS) between actual and predicted response values.

This residual can be re-written as

since we know, ŷi = β0 + β1xi

The main objective of OLS method is to minimize this residual or error (cost function). How do we do this? We take the partial derivative of the above residual or error (cost function) with respect to the coefficients β0 and β1 of determination for minimizing the error, then we set the partial derivatives equal to zero and solve for each of the coefficients.

Let’s do some math

Deriving β0 with partial derivation of cost function

Deriving β1 with partial derivation of cost function

These β0 and β1 values are useful to reduce cost function of a simple linear regression model(Binary classifier), where there is only one independent variable xi and one dependent variable yi.

For your reference I have added a clear picture of the coefficients that we derived from partial derivation of cost function(residual)

For a multiple linear regression problem, where there are more than one input variables, to find the coefficients that minimize the cost function, there is a closed-form mathematical equation called the Normal Equation, which is as follows

θ represents the value that minimizes the cost function.
y is the vector of target values from y1 to yn
X is the vector of predictor values

Performing linear regression using Scikit-Learn is quite simple:

>>> from sklearn.linear_model import LinearRegression
>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([4.21509616]), array([[2.77011339]]))
>>> lin_reg.predict(X_new)
array([[4.21509616],
[9.75532293]])

Here intercept and coef are β0 and β1 respectively.

Computational Complexity

The Normal Equation computes the inverse of XT . X (X transpose dot X), which is an (n + 1) × (n + 1) matrix, where n is the number of features. The computational complexity of inverting such a matrix is typically about O(n3). The Normal Equation approach gets very slow when the number of features grow large. To solve this problem we use Gradient Descent optimization technique.

Gradient Descent

Gradient Descent is an optimization technique where the intuition to start by picking a random value for θ(weights). Then, we can iteratively improve the estimate of θ and update the values of θ simultaneously. Eventually we we the optimal θ value where the cost function is minimum. This minimum is Gradient Descent is useful when you have a very large dataset.

The MSE(here it the cost funtion)is calculated for each pair of input and output values. Cost function as we already discussed is

This Cost function can also be written as following, since ŷi = hθ(xi)

Where,

After taking partial derivative of MSE with respect to θo, where θj is the coefficient of jth value, we will derive following equation.

Partial derivatives of the cost function

An important parameter, learning rate η, determines how fast or slow we will move towards the minimum(global minimum). It has to be set appropriately. The process is repeated until a minimum cost function. Another important parameter is the step size, this is determined by the learning rate η hyperparameter.

This figure demonstrates how learning rate affects the time taken to reach the minimum. θ is the coefficient(weights of input variable), J(θ) is the cost function

There are three different methods in Gradient Descent which we can use to get the optimal coefficients. They are:

Batch Gradient Descent
Stochastic Gradient Descent
Mini — Batch Gradient Descent

Click below for in-depth understanding of Gradient Descent with. I have explained how each type of Gradient Descent works including learning rate, step size, global minimum, local minimum.

What is Gradient Descent?

Gradient Descent, Math behind Gradient Descent, Types of Gradient Descent

medium.com

Polynomial Regression

What if your data is actually more complex than a simple straight line? Surprisingly, you can actually use a linear model to fit non-linear data. A simple way to do this is to add powers of each feature as new features, then train a linear model on this extended set of features. This technique is called Polynomial Regression.

The equation of polynomial becomes something like this.

y = a0 + a1x1 + a2x12 + … + anx1n

let’s generate some nonlinear data, based on a simple quadratic equation

m = 100
X = 6 * np.random.rand(m, 1) — 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)

let’s use Scikit-Learn’s PolynomialFeatures class to transform our training data, adding the square (2nd-degree polynomial) of each feature in the training set as new features (in this case there is just one feature):

>>> from sklearn.preprocessing import PolynomialFeatures
>>> poly_features = PolynomialFeatures(degree=2, include_bias=False)
>>> X_poly = poly_features.fit_transform(X)
>>> X[0]
array([-0.75275929])
>>> X_poly[0]
array([-0.75275929, 0.56664654])

X_poly now contains the original feature of X plus the square of this feature. Now you can fit a LinearRegression model to this extended training data

>>> lin_reg = LinearRegression()
>>> lin_reg.fit(X_poly, y)
>>> lin_reg.intercept_, lin_reg.coef_
(array([1.78134581]), array([[0.93366893, 0.56456263]]))

Not bad: the model estimates y = 0 . 56x1 2 + 0 . 93x1 + 1 . 78 when in fact the original function was y = 0 . 5x1 2 + 1 . 0x1 + 2 . 0 + Gaussian noise.

When there are multiple features, Polynomial Regression is capable of finding relationships between features. This is made possible by the fact that PolynomialFeatures also adds all combinations of features up to the given degree. For example, if there were two features a and b, PolynomialFeatures with degree=3 would not only add the features a^2, a^3, b^2, and b^3, but also the combinations ab, a^2b, and ab^2.

If you perform high-degree Polynomial Regression, you will likely fit the training data much better than with plain Linear Regression. this high-degree Polynomial Regression model is severely overfitting the training data, while the linear model is underfitting it. The model that will generalize best in this case is the quadratic model. It makes sense since the data was generated using a quadratic model, but in general you won’t know what function generated the data.