Logistic Regression: The Math and The Code

A Deep Down look into Logistic Regression, the basic element of Neural Networks

Gautham Sreekumar

Published in

ADGVIT

9 min readSep 23, 2020

Logistic Regression is a machine learning technique used in classification problems.

When we try to ignore the fact that the dependent variable (y) in a dataset is discrete-valued and uses a linear regression algorithm to predict y using the independent variable (X), then the output will be highly influenced by the outliers. This, in turn, affects the threshold value(Decision boundary) of y which determines the class to which it belongs.

Outliers: “ An outlier is an observation that lies an abnormal distance from other values in a random sample from a population ”

The logistic regression is used when the dependent variable (target) is categorical. The model uses the sigmoid function to squeeze the output of a linear equation between 0 and 1, which can then be mapped to two or more discrete classes. Every real value can be mapped to a value between 0 and 1, which signifies the probability of belonging to a class.

In deep learning, the two frequently used terms are Activations and Parameters. Activations are the result of either a matrix multiplication or an activation function, and parameters are the numbers inside the matrix that we multiply.

𝛉 is regression parameter.

The Math

Let us assume that we have a binary classification problem (y = 0 or 1). Then, the probabilities that the class y = 1 or y = 0 given x is:

combining the above two equations,

Let there are N independent samples, then the Likelihood, L(𝛉) of the parameters is

Where, 𝑝(𝑦 |𝑥; 𝜃) is read as “ Probability of y given x and parametrized by θ ”, i.e., the probability that the function takes the value of y, at a given value of x and θ.

For the easiness of computation, we take the log:

Natural logarithm is a monotonically increasing function and it ensures that the maximum value of both the log of the probability and the original probability function occurs at the same point. Therefore, we use the simplier log-likelihood.

Maximizing the log-likelihood will give the best values for the parameters, which is the same as minimizing the negative of this function. This loss function, J(𝛉) is typically known as the cross-entropy loss function (also known as negative log-loss). Since scaling a function does not change its maximum or minimum point, usually, we divide the cross-entropy loss by the total number of samples.

To compute the partial derivatives of loss function w.r.t parameters in the network, we use Backpropagation.

The Matrix form of the variables is be used hereafter in all equations.

The derivative of the loss function with respect to its input can be calculated as:

The derivative of the sigmoid function with respect to its input is:

The derivative of the linear equation with respect to its weights and bias is:

Note: The gradient of any matrix multiplication AB w.r.t A is Transpose of B.

So, the derivative of the loss function with respect to parameters (weights and biases) is calculated using the chain rule:

Gradient descent algorithm is used for finding out a minimum of a differentiable function by taking steps proportional to the negative of the gradient of the function at the current point. We use mini-Batch Gradient Descent, which takes random samples (eg., 64) from the training dataset for computing gradients.

The parameter θ is updated by adding the product of the negative of the gradient of the loss function w.r.t parameters and the learning rate(𝛼):

In logistic regression, this gradient step can be represented as follows:

Interpretation of Logistic Regression

Logistic Regression assumes that the classes are almost/perfectly linearly separable, which means it can be divided with the help of a straight line (in 2D) or a plane/hyperplane (3D or more).

Consider the sigmoid function g(z) with a decision boundary of 0.5.

If g(z) ≥ 0.5, then that point is classified as +ve class.
If g(z) < 0.5, then it is classified as -ve class.
To get g(z) >= 0.5, z should be greater than or equal to 0.
To get g(z) < 0.5, z should be less than or equal to 0.

Consider two points, Xi and Xj, and the normal, W to the plane z.

The distance of the point Xi and Xj from the plane z is

Hence, we can say that the distance of a point, Xi from the plane z is the value of z when x = Xi.

So,

When the value of z > 0 if x= Xi, the point Xi is lying above z.
When the value of z < 0 if x= Xi, the point Xi is lying below z.

After finding the best fitting plane i.e., z, for the data points, the prediction is done as

All the points above the plane z, i.e., z ≥ 0 is classified as +ve class.
All the points below the plane z, i.e., z < 0 is classified as -ve class.

The Code: Implementation of Logistic Regression

class Module:
    def __call__(self, *args):
        self.args = args
        self.out = self.forward(*args)
        return self.out
    
    def forward(self): 
        raise Exception('not implemented')
    
    def backward(self): 
        self.bwd(self.out, *self.args)

Class Module

It is an abstract class. The __call__ method takes arguments from the user and saves it to the ‘args’ variable. Then, it calls the forward function and stores the return value in the ‘out’ variable, and finally, returns the ‘out’ value. Forward and backward methods are used for forwarding and backward pass of neural networks. All the layers are defined by inheriting the Module class.

The __call__ method in python enables to write classes where its instances behave like functions. Instances with __call__ methods are said to be callable objects.

Class Linear

class Linear(Module):
    def __init__(self, m, out):
        #Xavier Initialization
        self.W = torch.randn(m, out)*math.sqrt(2/(m+out)) 
        self.b = torch.zeros(out)
        
    def forward(self, inp): 
        return inp @ self.W + self.b
    
    def bwd(self, out, inp):
        inp.g = out.g @ self.W.T
        self.W.g = inp.T @ out.g
        self.b.g = out.g.sum(0)

The Weights W in a layer is initialized using a technique called Xavier initialization. The random normal distribution is multiplied by a factor of 2/(n_in + n_out)to avoid the gradient vanishing/explosion problem, where n_in, n_out are the number of features and out features in the layer respectively. The bias matrix b is initialized to zero. The forward method of the linear layer is the sum of the matrix multiplication of the input matrix with the weight matrix and the bias matrix.

torch.matmul() or @ in pyTorch is used for matrix multiplication.

Class Sigmoid

class Sigmoid(Module):
    def sigmoid(self, x):
        return 1/(1+ torch.exp(-x))    
    
    def forward(self, x): 
        return self.sigmoid(x)
    
    def bwd(self, out, inp): 
        inp.g = (self.sigmoid(inp)*(1-self.sigmoid(inp))) * out.g

Class Cross-Entropy

class Cross_Entropy(Module):
    
    def __init__(self, model):
        self.model = model    def cross_entropy(self, y_pred, y):
        return -(y*torch.log(y_pred) + (1-y)*torch.log(1-y_pred)).mean()
    
    def forward(self, y_pred, y):
        return self.cross_entropy(y_pred.view(-1), y)
    
    def bwd(self, out, inp, targ):
        inp.g = ((1-targ.unsqueeze(-1))/(1-inp) - targ.unsqueeze(-1)/inp)/targ.shape[0]
        self.model.backward()

Backward Pass

As we all know for the backward pass, we use the chain rule to compute gradients of loss function w.r.t parameters of the model.

The variable inp is the input of the current layer (i.e., the output of the previous layer), and out is the output of the current layer (i.e., input of the next layer).

Forward and Backward Pass through Sigmoid Function

In bwd method of class Cross-Entropy, the gradient of loss w.r.t the output of the previous layer is stored in attribute g of the previous layer(inp), i.e., inp.g = ((1-targ.unsqueeze(-1))/(1-inp) — targ.unsqueeze(-1)/inp)/targ.shape[0] where, targ is the target variable and .unsqueeze(-1) adds dimension to the matrix.

Similarly, in class Sigmoid, the product of the gradient of sigmoid w.r.t its inputs, and the gradient of its next layer is stored, i.e., ing.g = (sigmoid(inp) * (1-sigmoid(inp))) * out.g

In general, we compute the gradient of the current layer w.r.t the output of previous layer and multiply it with the gradient of its next layer (out.g). Then it is stored in the attribute g of the previous layer (inp).

In class Linear, the gradient of matrix multiplication w.r.t the Weight matrix is calculated and stored in W.g , and b.g is the gradient of the next layer summed over the columns (axis=0). Also, the matrix multiplication out.g with the transpose of the Weight matrix is stored ininp.g.

inp.g = out.g @ self.W.T
self.W.g = inp.T @ out.g
self.b.g = out.g.sum(0)

The method ‘step’ in the Class SGD is used to update the parameters of the linear layer. The variable lr is a hyper parameter called learning rate, which we need to use trial and error method to find the best value.

class SGD:
    def __init__(self, model, lr=0.01):
        self.lr = lr
        self.model = model
        
    def step(self):
        self.model.linear.W -= self.model.linear.W.g * self.lr
        self.model.linear.b -= self.model.linear.b.g * self.lr

Logistic Regression class is created by sequentially combining the linear and sigmoid layers, also we define the forward and backward methods.

class LogisticRegression:
    
    def __init__(self, in_features, out_features):
        self.linear = Linear(in_features, out_features)
        self.sigmoid = Sigmoid()    def forward(self, x):
        return self.sigmoid(self.linear(x))
    
    def backward(self):
        self.sigmoid.backward()
        self.linear.backward()    def __call__(self, *args):
        self.args = args
        self.out = self.forward(*args)
        return self.out

Then we initialize the logistic regression model, optimizer, and the loss function with appropriate hyperparameters.

model = LogisticRegression(30,1)
optimizer = SGD(model, lr=1e-2)
loss_func = Cross_Entropy(model)

The model is then trained by calling the fit function with the number of epochs as an argument. A fit function performs forward pass, computation of loss, backward pass, and updation of the parameters. Then it validates the model by computing the loss using the validation dataset.

def fit(epochs):
    train_loss = []
    valid_loss = []
    for epoch in range(epochs):
        for X, y in trainset:
            ## forward pass
            output = model(X)
            ## Computing loss
            loss = loss_func(output, y)
            ## Backward pass
            loss_func.backward()
            ## Updating the parameters
            optimizer.step()
        train_loss.append(loss)
        
        ## Validating the model
        val_loss= []
        for X_val, y_val in validset:
            out = model(X_val)
            vloss = loss_func(out, y_val)
            val_loss.append(vloss)
        valid_loss.append(torch.stack(val_loss).mean())
        print(f'Epoch : {epoch+1} | Train Loss: {train_loss[epoch]:.4f} | val Loss: {valid_loss[epoch]:.4f}')
    return train_loss, valid_lossEPOCHS = 60
trainLoss, validLoss = fit(EPOCHS)

Plotting the Loss graph:

plt.plot(trainLoss)
plt.plot(validLoss)
plt.legend(['train', 'valid'])
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Loss Graph');

The full code can be found at the following GitHub repo (https://github.com/GauthamSree/Logistic-Regression).