Logistic Regression Implementation From Scratch—A Step By Step Approach

Ali Naeem Chaudhry
12 min readJun 20, 2024

--

Fig. 1

Suppose we want to detect the presence or absence of something like a cat or a spam message. To do so, we’d need a binary classifier that can differentiate between two classes, for instance, a cat classifier distinguishes between “CAT” and “NOT CAT”, and a spam classifier identifies “SPAM” and “NOT SPAM” classes.

If you are lucky enough and the features of your classes are well separated, like in Fig. 1, then you may be able to use the Logistic Regression model which defines a linear boundary (like the green dashed line in Fig. 1) to distinguish between the two classes.

The Decision Boundary

The example shown in Fig. 1 has a 1-D decision boundary (A Line) since the classes have only two features that can be separated by a 1-D boundary and we’ll see how. The equation of this decision boundary is given as:

Eqn. 1

Variables Description:

𝑥₁: Feature 1

𝑥₂: Feature 2

w₁: Weight value corresponding to Feature 1

w₂: Weight value corresponding to Feature 2

b: Bias or offset

The ideal decision boundary is the one that separates the two classes wholly such that there are “CLASS 0” instances on one side of the boundary and “CLASS 1” instances on the other side. However, this might not be possible realistically as also demonstrated in Fig. 1 where a few examples of one class are intermingled with the other which are not separable. Hence, an optimal decision boundary is the one that separates most of the instances of the two classes.

Fig. 2

Every point (𝑥₁, 𝑥₂) that lies on the decision boundary, as shown in Fig. 2(b), is described by Eqn. 1. This boundary divides the 𝑥₁ 𝑥₂- plane into two regions: the region below it and the region above it. The points lying in these two regions when substituted in the equation of the boundary, produce outputs with opposite signs, as shown in Fig. 2 (a) and (c), respectively.

But what is the point (𝑥₁, 𝑥₂)? It’s the pair of Feature 1 and Feature 2 and our goal is to find a decision boundary such that most of the features (𝑥₁, 𝑥₂) associated with “CLASS 0” exist below the boundary and those associated with “CLASS 1” above the boundary or vice versa and the equation of boundary will help us to navigate by telling if the features lie on the expected side of the boundary or not.

Increasing the Dimensionality

Before talking about how to achieve an optimal decision boundary let’s generalize its dimensions.

For better visualization and understanding, we’ve only considered two features: 𝑥₁ and 𝑥₂, however, practically, the number of features may be far greater than just two.

What do you think, increasing the number of features will do to the decision boundary?

As we had classes with two features, we were able to plot them in a 2-D plane and a 1-D decision boundary (A Line) was able to separate them by defining two regions where the two classes could reside (Fig. 1).

Suppose now our classes have 3 features 𝑥₁, 𝑥₂ and 𝑥₃ which have to be plotted in a 3-D space, however, they can’t be separated by a line anymore, but a 2-D plane as shown in the animated Fig. 3 below:

Fig. 3

The equation of a 2-D decision boundary (A plane) is given as:

Eqn. 2

Generally, two classes with n-dimensional feature space can be distinguished using an (n-1)-dimensional hyperplane or decision boundary. Beyond 3-D feature space, (n-1)- dimensional hyperplane can’t be imagined or visualized like a 1-D line or a 2-D plane, however, its equation can be easily expressed as:

Eqn. 3

Using matrix notation, we can succinctly write the equation as:

Eqn. 4
Eqn. 5

The concept of an (n-1)-dimensional hyperplane’s equation (Eqn. 5) producing a positive output, for the points or features lying above it and a negative output for those lying below it, seamlessly carries to the n-dimensional feature space.

In Eqn. 5, notice that 𝑥 is the input feature vector which is not in our control, where the weights contained in the weight vector w and the bias b are the decision boundary parameters that we have to adjust according to the training data points such that most of the points associated with the two classes are separated by the decision boundary. After that, we can say that our model is trained and can classify the unseen similar data.

Later in this article, we discuss how we can adjust the values of w and b to achieve an optimal decision boundary.

The Sigmoid Function

Fig. 4

In logistic regression, whether the input feature vector 𝑥 belongs to “CLASS 0” or “CLASS 1” is reported in the form of probability by using the sigmoid function. It is given as:

Eqn. 6

The sigmoid function outputs 0 for z = −∞, 0.5 when z = 0, and 1 for z = +∞. Since its range is from 0 to 1, it serves nicely as a probability function. We input the hyperplane equation (Eqn. 5) into the sigmoid function and it maps the output of the hyperplane equation to the probability as:

By convention, the sigmoid output region between 0.5 and 1 corresponds to “CLASS 1” and the region between 0 and 0.5 belongs to “CLASS 0”. So, the prediction could be made based on the following conditions:

Consider the output of the sigmoid function as the probability that the input belongs to “CLASS 1”. This means that if the output of the sigmoid function is 0.8 then the model has a confidence of 0.8 that the input belongs to “CLASS 1” and, consequently, a confidence of 0.2 that it belongs to “CLASS 0”.

Since the sigmoid function reports probability, you may pick a value for the decision threshold other than 0.5, based on your priority that is whether you prioritize the ratio of the number of correct predictions to the total number of predictions made in favor of a particular class or the ability to detect most of the instances of that particular class. Read more about this here:

The Likelihood and Loss Function

The probability that an input feature vector x belongs to “CLASS 1”, given the model parameters w and b is determined as:

Eqn. 7

Conversely, the probability that an input feature vector x belongs to “CLASS 0”, given the model parameters w and b is determined as:

Eqn. 8

Combinedly, the probability that an input feature vector x belongs to class y (0 or 1), given the model parameters w and b is calculated as:

Eqn. 9

Suppose y is the actual label of an example from the training dataset, then Eqn. 9 gives the probability, estimated by the model, that the input belongs to class y when the input is known to belong to class y or the actual label of the example is y.

Hence, we want Eqn. 9 to produce a value close to 1 because it means that the model holds a high confidence that a given input belongs to “CLASS 0” when it actually belongs to it. Similarly, the model outputs a high probability that a given input belongs to “CLASS 1” when its actual label is also “CLASS 1”.

We can extend this probability function to the whole training dataset by using the multiplication rule of probability, to give the Likelihood function as:

Eqn. 10

Variables Description:

y⁽ⁱ⁾: Label for iᵗʰ training example (0 or 1)

x⁽ⁱ⁾: iᵗʰ training example

m: Total training examples

Likelihood tells how well the model’s estimations align with the actual labels of the training dataset examples and our goal is to maximize it.

Fig. 5

It is convenient to calculate the log of likelihood:

Eqn. 11

Making substitution from Eqn. 9:

Eqn. 12
Eqn. 13

Note that the likelihood, 𝓛(w, b), ranges from 0 to 1 which means that the log-likelihood, 𝓁(w, b), will range from −∞ to 0. Where the values of likelihood, 𝓛(w, b), close to zero indicates a severe discrepancy between the model’s estimations and the actual labels of the training data. This poor performance translates to large negative values of log-likelihood, 𝓁(w, b).

What if we negate the log-likelihood?

The range of negative log-likelihood will become 0 to +∞. Where large positive values indicate a huge error or mistake in the model’s estimation and the values close to zero signify that the model’s estimations align closely with the actual labels of the training data. This means that negative log-likelihood can be considered as an error, cost, or loss function, where large values mean large error and small values signal small error.

The loss function in logistic regression, aka Binary Cross Entropy, is the negative log-likelihood, normalized by the number of training examples.

Eqn. 14

Making substitution from Eqn. 13:

Eqn. 15

Normalizing by the number of training examples m gives the average loss and it prevents the size of the training dataset from influencing the value of loss. As a result, the performance of different models can be compared easily.

Finding the Optimal Decision Boundary

Since our objective is to maximize the likelihood. We can achieve the same effect using an alternate and convenient approach and that is minimizing the loss function.

From Calculus you may remember that at the minimum of a function, the gradient is zero. Notice that the loss function is dependent on the model parameters (w, b) and we can reach its minimum by varying these parameter values.

Let’s find the gradient of the loss function w.r.t the weight vector w and bias b so that we can reach its minimum using the gradient descent optimization algorithm.

The gradient of loss function w.r.t j ᵗʰ element of the weight vector w is given as:

Eqn. 16
Eqn. 17

The gradient of sigmoid function is given as:

Eqn. 18

To see the detailed derivation of the gradient of sigmoid function, follow this link:

Making substitution from Eqn. 18 in Eqn. 17 and applying chain rule:

Eqn. 19
Eqn. 20
Eqn. 21
Eqn. 22
Eqn. 23

Similarly, the gradient of loss function w.r.t the bias b can be computed as:

Eqn. 24
Eqn. 25

Making substitution from Eqn. 18 in Eqn. 25 and applying chain rule:

Eqn. 26
Eqn. 27
Eqn. 28
Eqn. 29
Eqn. 30

Gradient Descent

Fig. 6

The loss function may look like this and we want to reach its bottom where the loss is minimal. In approaching the minimum point of this loss function, the gradient will help. As you can see in the figure, to the left of the minimum point, the gradient is negative, and to its right, the gradient is positive, where it is zero at the minimum point.

Suppose you are at the left of the minimum point, then to approach it you have to move to the right which amounts to increasing the value of wⱼ or b. Similarly, if you are at the right of the minimum point, then you have to move to the left to approach it by decreasing the value of wⱼ or b. This process of moving in the opposite direction of the gradient to reduce the loss is known as gradient descent. Mathematically:

Eqn. 31
Eqn. 32

Here α is the learning rate which determines the step size of wⱼ and b toward the minimum point. Its value typically lies between 0 and 1.

After the updation of model parameters wⱼ and b, the loss should be reduced as you have taken a step toward the minimum point of the loss function.

Training the Logistic Regression Model

Before starting the training process, we’ll normalize the feature values (all the values of 𝑥 that exist in the training dataset) so that they lie on a standard scale, typically around zero, which is necessary for smooth convergence of the model. There are several normalization techniques, we’ll go with the Min-Max scaler:

Eqn. 33

Following are the steps to train the logistic regression model:

  1. Prepare the training and test datasets.
  2. Normalize all the features of the training dataset to bring them on a standard scale using Eqn. 33. Use the same parameters calculated from the training set (𝑥ₘᵢₙ and 𝑥ₘₐₓ) to normalize the test set.
  3. Start with random values of weight vector and bias, typically around zero.
  4. Calculate the loss or binary cross entropy using the training dataset using Eqn. 15.
  5. Calculate the gradient of loss w.r.t all the weights and the bias using Eqn. 23 and Eqn. 30, respectively.
  6. Update all the weight values in the weight vector and the bias simultaneously using Eqn. 31 and Eqn. 32, respectively.
  7. Go back to step 4 and continue until the loss stops showing any significant improvement or repeat for a set number of times.
  8. Test the performance of the model by calculating the loss, accuracy, precision, or recall using the test set.

Results

I’ve implemented the logistic regression model on the example that has been discussed in this article i-e binary classification with 3-D feature space. Below are the results and the link to the code:

Fig. 7
Fig. 8

The accuracy on the test set comes out to be 100%, however, this was only possible because the contrived data points were very simple and easily separable by a linear decision boundary.

Find the code here:

--

--

Ali Naeem Chaudhry

An enthusiastic Machine Learning Engineer who has a passion to gain deep insights into the nitty-gritty of AI and imparting it to others.