Designing Linear Regression from scratch

Rituparna Gupta
The Startup
Published in
5 min readJan 18, 2019
Photo by Antoine Dautry on Unsplash

Objective

In this notebook, we would attempt to construct the Linear Regression Algorithm from scratch, using Linear Algebra principles with Python as the programming language

The idea behind Linear Regression

To begin the process of understanding how Machine Learning algorithms work internally, Linear Regression often serves as the most logical & intuitive starting point, since it’s based on the understanding of basic Linear Algebra

In Linear Algebra, a line in a coordinate system is defined by the equation:

y = mx + c

where,
m is coefficient (Also called the the slope/gradient of the line or the rate at which y varies with x)
c is the intercept, which denotes the point at which the line intersects the y-axis

Provided m & c are given, the equation can be used to find out the values for y corresponding to a given set of values for x.
An alternative way to think about this is, whether there exist such values m & c given which a linear relationship can be established between x & y. In case yes, then one variable (y — the dependent/response variable) can be said to be linearly related to the other variable (x — the independent/predictor variable)

In Machine Learning, the same concept can be used to determine the relationship between real-world entities.
For instance, given the attributes “weight” & “height” for a particular set of people at a certain age, and the objective to find out whether their weight is related to their height or not — Applying Linear Regression algorithm here would help determine if there is such a combination of values for which their height is found to be linearly related to their weight

How it works:

Linear Regression algorithms start with the assumption that there IS a linear relationship between the 2 given variables — this is called the initial hypothesis
The algorithm then attempts to find the values for the coefficient ‘m’ & the intercept ‘c’. However, for real world entities, absolute linearity might not be always possible — so the objective is to find the values in such a way that the relationship between x & y is linear to the maximum possible extent, that is, the resulting line should be as close as possible to all the given data points. There are different methods for optimizing this, however in this notebook we would explore the Ordinary Least Squares technique
The calculated values can then be used to predict the response variable (in Predictive Analytics). And in this way, the algorithm goes on to determine if the assumption or initial hypothesis was correct — whether (at all) or how strongly the given variables are linearly related to one another

Designing the algorithm

About the data

We will use the Wine Quality dataset available at the UCI datasets repository for this exercise: https://archive.ics.uci.edu/ml/datasets/Wine+Quality
The dataset lists recorded values for different attributes which contribute to the quality of Portuguese “Vinho Verde” wine. The analysis would be to find out if there is a linear relationship between any of those attributes & the target variable “quality”

Here, since the objective is to build the algorithm from scratch, we will use only one of the attributes (univariate) as a predictor variable (Volatile Acidity), & try to determine its relationship to the response variable (Wine Quality)

Steps:

  • Determine the coefficient & intercept based on Ordinary Least Squares criteria
  • Use those values to predict the response variable
  • Calculate accuracy, using R-squared criteria
  • Implement the same using the Linear Regression models available in Python libraries: Statsmodel & Sci-kit learn
  • Compare the accuracy of this model with the accuracy from Statsmodel & Sci-kit learn implementations

Determine the coefficient & intercept using OLS

Ordinary Least Squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the square of the sum of the differences between the points on the resulting line to the actual data points

Per this method,
the coefficient is given by the formula:

the intercept is given by the formula:

We will now apply this formula on our dataset to determine the coefficient (m) & intercept ©

xm=df["volatile_acidity"].mean()
ym=df["quality"].mean()
m = np.sum((df["volatile_acidity"] - xm) * (df["quality"] - ym))/np.sum(np.square(df["volatile_acidity"] - xm))
c = ym - b1*xm

On running the above code, we get the following:
Coefficient: -1.761437780112675
Intercept: 6.565745506471793

Predict the response variable “quality”

Now, use these coefficient & intercept values to predict wine quality

# y=mx + c
df["pred"] = m*df["volatile_acidity"] + c

Calculate accuracy (R-squared metric)

R-squared metric is given by the formula:

where,
RSS= Residual Sum of Squares
TSS= Total Sum of Squares

where, e[ith] = y[predicted]−y[actual], represents the ith residual — that is the difference between the ith predicted values and the ith observed response, and

We will use the above formula to calculate R-squared for our predictions:

r2 = 1 - (np.sum(np.square(df["pred"] - df["quality"])) / np.sum(np.square(df["quality"] - ym)))

R-squared: 0.15253537972475092

Statsmodel Linear Regression

Now let’s implement the same using Statsmodel & Sci-kit learn

X = df[['volatile_acidity']]
y=df[["quality"]]
model = ols("""quality ~ volatile_acidity""", data=df)
model=model.fit()
predictions = model.predict(X)

Model Summary:

Coefficient: -1.761438
Intercept: 6.565746
R-squared: 0.15253537972474862

Sklearn Linear Regression

lm = linear_model.LinearRegression()
model1 = lm.fit(X,y)
predictions1 = lm.predict(X)

Coefficient: -1.76143778
Intercept: 6.56574551
R-squared: 0.1525353797247485

Conclusion

As we can see, all the values are the same across different implementations:

  • Coefficient: -1.76
  • Intercept: 6.565
  • R-squared: 0.152

It shows that our self implementation of Linear Regression is in sync with implementations from library defined models
Hence, we can conclude to have successfully built a Linear Regression model from scratch

This story is published in The Startup, Medium’s largest entrepreneurship publication followed by +412,714 people.

Subscribe to receive our top stories here.

--

--

Rituparna Gupta
The Startup

Ardent Writer | Essays | Opinion | Lifelong Learner