Linear Regression Using Normal Equations and Polynomial Regression.

Rajwrita Nath
7 min readFeb 23, 2020


Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. It is mostly used for finding out the relationship between variables and forecasting. This blog gives a brief idea of the two different regression algorithms and how they are derived mathematically using normal equations.

Data on two variables recorded simultaneously for a group of individuals are called bi-variate data. Examples of bi-variate data are heights and weights of the students in a class, the rainfall and the yield of paddy in a state for several consecutive years, etc.

When we have bi-variate data, we can, no doubt, consider the values of each variable separately to know the different measures like the mean and standard deviation of the variable; but here we are mainly concerned with two other problems.

Firstly, we want to study the nature and extent of association, if any, between the variables.

Secondly, if the variables are found to be associated we express one of them (regarded as the dependent variable) as a mathematical function of the other (considered as an independent variable), so that we can predict the value of the dependent variable when the value of the independent variable is known.

The first problem is called correlation analysis and the second, regression analysis.

To find the relationship between continuous correlated variables we use linear regression. Linear regression looks for a statistical relationship between a set of correlated values. The representation is a linear equation that combines a specific set of input values (x) the solution to which is the predicted output for that set of input values (y). As such, both the input values (x) and the output value are numeric. When there is a single input variable (x), the method is referred to as simple linear regression. When there are multiple input variables,the method is referred to as multiple linear regression.

Derivation of linear regression equation:

Let the linear regression equation of y on x be

y=a +bx

Since, we would like to use this equation for prediction purposes,, the constants a and b have to be estimated on the basis of observed values of x and y. Suppose we are given n pairs of values , (xi, yi), i = 1(1)n, of x and y. From among different methods that are available for the determination of a and b, we use the method of least squares which has many desirable properties.

When x=xi, the observed value of y is yi and the predicted value of y is a+bxi. So,

ei = yi — (a + bxi)

This ei is the error in taking a + bxi for yi. This is called the error of estimation. The method of least squares requires that a and b be so determined that

ei ² = ( yi — a — bxi ) ²

Whence we get,

Normal Equations
Linear Regression Equation of y on x

The quantity r (sy / sx), usually denoted by byx , is called the regression coefficient of y on x. It gives the increment in y for unit increase in x.

Modelling Simple Linear Regression

The very first step is to import the libraries.

import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn import metrics
import statsmodels.api as sm
import matplotlib.pyplot as plt
%matplotlib inline

We import a data set having (x, y) pairs of values.

df = pd.read_csv('test.csv', index_col=False)

We use matplotlib , a popular Python plotting library to make a scatter plot.

plt.figure(figsize=(16, 8))
Scatter Diagram

As you can see, there is a clear relationship between the variables ‘x’ and ‘y’.

Now we will focus on getting a linear approximation of the data.

X = df['x'].values.reshape(-1,1)
y = df['y'].values.reshape(-1,1)
reg = LinearRegression(), y)
print("The linear model is: Y = {:.5} + {:.5}X".format(reg.intercept_[0], reg.coef_[0][0]))
The linear model is: Y = -0.46181 + 1.0143X

Following which we visualize how the line fits the data.

predictions = reg.predict(X)
plt.figure(figsize=(16, 8))
Linear Fit

How relevant is my model?

The relevancy of the model is judged by the R² value. The R² metric, it measures the proportion of variability in the target that can be explained using a feature X. Therefore, assuming a linear relationship, if feature X can explain (predict) the target, then the proportion is high and the R² value will be close to 1. If the opposite is true, the R² value is then closer to 0.

Here is how the process is done.

X = df['x']
y = df['y']
X2 = sm.add_constant(X)
est = sm.OLS(y, X2)
est2 =

These lines of code give us the following output:

Determination of R² value

R-squared is a statistical measure of how close the data are to the fitted regression line. It is also known as the coefficient of determination, or the coefficient of multiple determination for multiple regression. 0% indicates that the model explains none of the variability of the response data around its mean.

In this case, a R² value of 0.989 indicates that about 98% of the variability of ‘x’ is explained by ‘y’.

Linear Regression works on data where the dependent and independent variable have a linear relationship.

But in cases where the data do not have a linear relationship and instead possess a rather complex relationship, then Polynomial Regression is used.

What is Polynomial Regression?

Polynomial Regression is a form of linear regression in which the relationship between the independent variable x and dependent variable y is modeled as an nth degree polynomial. Polynomial regression fits a nonlinear relationship between the value of x and the corresponding conditional mean of y, denoted E(y |x).

Polynomial Regression is used to overcome the problems of under fitting of data found in simple linear regression.

The linear equation used earlier :

y=a +bx

is now converted to:

y=a +bx+cx²

This is still considered to be linear model as the coefficients/weights associated with the features are still linear. x² is only a feature. However the curve that we are fitting is quadratic in nature.

The equation of Polynomial Regression can be generalized as (up to nth degree):

y = a + b1x + b2x² +….+ bnx^n

Modelling Polynomial Regression

To understand polynomial regression, we first generate a data set. The following code is used to generate a random set of values.

x = 2 - 3 * np.random.normal(0, 1, 20)
y = x - 2 * (x ** 2) + 0.5 * (x ** 3) + np.random.normal(-5, 5, 20)
plt.scatter(x,y, s=10)

First we apply the linear regression model, this step gives us an idea of the drawback of using linear regression model in this case.

x = x[:, np.newaxis]
y = y[:, np.newaxis]
model = LinearRegression(), y)
y_pred = model.predict(x)
plt.scatter(x, y, s=10)
plt.plot(x, y_pred, color='r')

We observe a case of under-fitting of data here. The R² value is also calculated and found out to be 0.605. To overcome this drawback of under-fitting, we increase the complexity and thereby aim at establishing a higher order equation.

To convert the original features into their higher order terms we will use the PolynomialFeatures class provided by scikit-learn. Next, we train the model using Linear Regression.

# transforming the data to include another axis
x = x[:, np.newaxis]
y = y[:, np.newaxis]
polynomial_features= PolynomialFeatures(degree=2)
x_poly = polynomial_features.fit_transform(x)
model = LinearRegression(), y)
y_poly_pred = model.predict(x_poly)
rmse = np.sqrt(mean_squared_error(y,y_poly_pred))
r2 = r2_score(y,y_poly_pred)
plt.scatter(x, y, s=10)
# sort the values of x before line plot
sort_axis = operator.itemgetter(0)
sorted_zip = sorted(zip(x,y_poly_pred), key=sort_axis)
x, y_poly_pred = zip(*sorted_zip)
plt.plot(x, y_poly_pred, color='m')
Degree 2

Similarly, degree 3 and another arbitrary degree 20 graphs are also plotted.

Degree 3
Degree 20

While observing these graphs, the key question that comes to our mind is which is the best fit line.

Degree 2 does solve the problem of under fitting of data better than a simple linear regression model. However the R² value can be improved even more.

Degree 3 covers more number of data points than degree 2. The curve is the best fit example in this case with low variance and low bias.

Degree 20 covers most of the data points. However, this is a case of over fitting of data. Thereby, it will fail to generalize on unseen data.

To prevent over-fitting, we can add more training samples so that the algorithm doesn’t learn the noise in the system and can become more generalized.

To understand the best fit line, Bias vs Variance Trade off must be understood.

Bias are the simplifying assumptions made by a model to make the target function easier to learn and variance is the amount that the estimate of the target function will change if different training data was used.

The goal of any supervised machine learning algorithm is to achieve low bias and low variance. In turn the algorithm should achieve good prediction performance.

For detailed codes, head over to my Github Repository on Regression!

This blog covers preliminary concepts of Linear and Polynomial Regression. Logistic Regression shall be covered in the next blog.



Rajwrita Nath

Women Techmakers Scholar 2020, DSC NSEC Lead, Moderator at Manning Publications Co.