Understanding and Writing your own Machine Learning Algorithms: Linear and Logistic Regression

Gregory Simon
Analytics Vidhya
Published in
5 min readJan 21, 2020
Photo by Thomas Myhre on Unsplash

If you have built a model for either a continuous variable (such as price) or a classification problem, chances are you’ve come across Linear and Logistic Regression. These are probably the two most popular machine learning algorithms today available in many programming languages (such as Scikit-learn in Python) and can be imported and run in just a few lines of code. However the simplicity of this process means it is very easy to build these models without needing a true understanding of the underlying mechanisms.

However, before implementing your own algorithm, you should understand the basics of how they work, where you can use them and where you can’t. If you know this, implementing your own algorithm will be a really good way to help you understand what they do.

In this blog post I will explain the intuition and the mathematics behind these well known algorithms and provide a way for you to write and test them yourself to gain a better understanding of what is going on, under the hood. This also means you will be able to optimise your algorithm for a specific scenario.

The following can be done in a variety of languages depending on what you are most comfortable with. I have performed calculations using NumPy (Python) and did the initial implementation in Octave — similar to MatLab but for those who can’t afford a license. The running of Octave files in Python can be achieved with the Oct2Py library.

Linear Regression

Linear Regression is used to predict a continuous outcome variable y, based on a combination of features X, and corresponding weights/coefficients, θ. We call our prediction our hypothesis, h_θ(x), and calculate the Cost Function , J(θ), where the Mean Squared Error is most commonly used.

Our prediction h(x) and the associated cost function/error term. Where n is the number of features and m is the number of training examples. We set x0 =1 so θ0 is our intercept.

As our J(θ) is a square term, our cost function takes on a parabola, where the minimum is where the error is smallest. The 1/2m is there to make the maths easier when we differentiate to find the rate of change in the error term.

Minimising the cost function by varying θ.

Now here comes the machine learning. We essentially calculate the error in our in our prediction and change the values of each coefficient in a direction which minimises the error. This works out nicely as we can use the differential of our cost function with respect to each coefficient which holds the other coefficients constant. This is known as the simultaneous update:

The simultaneous update for our coefficients, where alpha is the learning rate and J is our cost function.

This is a convenient equation as we also see that as we reduce theta, the error term is reduced and therefore the dJ/dθ term gets smaller. This means smaller steps are taken as we approach the minimum. Our value of alpha is the learning rate and must be suitably chosen in order to not overshoot the minimum and also so that our algorithm doesn’t take too long to run.

Too large an alpha (fast learning rate) results in the cost function blowing up, but too small a value of alpha (slow learning rate) and we don’t reach the minimum.

Vectorisation

In order to predict off multiple variables we can use vectorisation to make this easily computable with only minor adjustments to our code. This is the process of converting our features, X, into a matrix and our coefficients, θ, into a vector.

Below is the code, written in Octave and then Python, for the simultaneous update:

The entire gradient descent algorithm, vectorised, in Octave.
Simultaneous update, vectorized, in Python.

Logistic Regression

Now when we need to predict discrete values we use logistic regression. A lot of people would call Logistic Regression a classifier but in reality it isn’t. This will become clear as I go through the maths involved but essentially it is a probability predictor.

In the case of Logistic regression our hypothesis takes on different form:

The equation for the sigmoid function.
The sigmoid function with decision boundary set at 0.5.

The cost function for this changes as we will be using a sigmoid function to convert our prediction into a probability between 0 and 1. If we were to square this, as in MSE for Linear Regression, we would end up with multiple minima and gradient descent would fail to find the global minimum of the cost function.

Instead we use the Cross-Entropy or Log Loss function below:

The cost function can be combined into one equation, where, depending on whether y is 1 or 0, one side of this term disappears.

This equation also conveniently provides a larger penalty for the wrong prediction such that the cost function is greater.

Log-Loss is on the y-axis and our prediction is on the x-axis. Taken from the Stanford Machine Learning course on Coursera, by Andrew Ng.

Now we can put this into code:

Regularised Logistic Regression, vectorised, in Octave.
The vectorised algorithm for Regularised Logistic Regression in Python

So there you are, your own logistic and linear regression algorithm! I would still recommend sticking with the pre-built models as these will have been tried and tested for many different use cases - although you may find a use case to optimise and use your own!

Hopefully this has provided a useful understanding of how these frequently used models work and how you might debug one. If you’ve found this helpful or interesting please leave a comment!

--

--