Logistic Regression from Scratch: Multi classification with OneVsAll

Arya Mohapatra
Analytics Vidhya
Published in
7 min readJan 25, 2020
Source

Yes or No !! Red or Black !! Worthy or Not Worthy…Oh God !! Why do we have options ?🤔. Our world consists of lots of options oriented problems. Isn't it ?

So..In our real life how do we go about these scenarios where we have to choose a option? We analyse some aspects prior the decision. Like in a problem “what to wear (Red or Black) ?” We consider what kind of a occasion it is? When is the event? what would go right me being tanned 😁? So yes there are multiple factors that affects our decision. That’s exactly Logistic Regression does. In Logistic regression, we see the existing data which we call the dependent variables, we draw relation between them and we predict (the dependent variable) according to details we have. And what we predict is always dichotomous (binary).

But can we not use Logistic regression if we have multi-classification problem i.e. we have multiple dependent variables to predict ? In this article we will be building a model where using Logistic regression we will classify more than 2 variable.

But Let’s begin by understanding what is the idea behind Logistic Regression. “In Logistic regression, by the help of a hypothesis function which is also called as sigmoid function we calculate probabilities by giving some input data(known variables) and based on that, the model does the analysis to predict required classification.” What does this mean ? So if we have some dependent variables like x1,x2,x3.. and we have a function f(X),by using the dependent variables if our function calculates a number which lies between 0 to 1, then we can successfully classify the problem statement. So for a classification problem the basic concept of notation can be interpreted as the target values to be either 0 or 1. i.e. y ∈ {0, 1}. where y = Target or Dependent Variable.

So, if by some means we can find the value of our hypothesis function given X (feature values) in between a range of 0 to 1 then we can successfully classify the data set. So, if hθ​(x) is our Logistic regression function or hypothesis then, hθ​(x) need to satisfy the condition as 0≤ hθ​(x) ≤1. We can achieve the same by using θTX(Theta transpose X) in our Logistic Function.The logistic function can be noted as below:

snippet source

The following image shows us what the sigmoid function(logistic function) looks like:

snippet source

As we see above the g(z) function where z = θTX, will convert all the different valued features (dependent variables) into a number between 0 and 1. This will definitely help us in categorizing the target variables.

As we have mentioned earlier, logistic regression works on probability, so here our function(hθ​(x)) will give us the probability whether or not our output is 1. For example, if hθ​(x) = 0.8 then it says that chances of our output to be 1 is 80%. Hence from this we can deduct, the chances of our output to be 0 is 20%.

Decision Boundary:

In order to get our classification values 0 or 1, we can classify our features by inheriting the following understanding:

If ,hθ(x) ≥ 0.5 → y=1 and If (x) < 0.5 → y=0

So, if our input to ‘g’ is θTX, then that means, if θTX becomes more than 0 then our out probability will be close to 1 and we can classify the output to be 1 and vice versa. This idea can be noted as:

(x) = g(θTX) ≥ 0.5 when θTX ≥ 0

θTX ≥ 0 y=1

θTX < 0 y=0

The decision boundary is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function.

Cost Function:

The main objective of our model should be to find out the weights i.e . θ values.In order to find out the θ values we need a function (let’s call it cost function) and minimize it’s value by using different values of θ. The weights that will minimize the cost function will be our best suited weights. In Logistic regression we need to select a function which will be a convex function so that we can find out the local optima accordingly.

cost function for logistic regression looks like:

source

Cost(hθ(x), y) =−log(hθ(x)), if y = 1

Cost(hθ(x), y) =−log(1−hθ(x)), if y = 0

Cost(hθ(x), y) =0 if hθ(x)=y

Cost(hθ(x), y) →∞ if y=0 and hθ(x)→1

Cost(hθ(x), y) →∞ if y=1 and hθ(x)→0

We have plotted the cost function in our solution which shows the similar pattern. We generally start by picking random theta values and subsequently update the values with the help of gradient descent and measure how good our model turned out. That value is computed using the cost function, defined as:

source

Gradient descent

Our goal is to minimize the cost function and by changing the weights that is theta value we can achieve the same. We can update the θ values by taking the derivative of the cost function with respect to each weight. We can repeat the below formula for the same:

Source

The same formula can be written as:

Source

Below is the vectorized notation of the above formula:

Source

The alpha notation used is called the learning rate. It will basically help us in propagating towards the local optima of our convex function. I would suggest to read following article (Gradient Descent) to know more about Gradient descent and associated terms.

Multi-class Classification: One-vs-all :

Now we are going to approach data classification when we have more than two categories. We must extend our description instead of y = { 0,1 }, so that y = { 0,1 … n}.

One-vs-all is a strategy that involves training N distinct binary classifiers, each designed to recognize a specific class. After that we collectively use those N classifiers to predict the correct class. How do we do it in code? By considering one class as 1 and rest all as 0, we train the model and get the requisite wights(theta value). We store the value of weights in a dictionary format for each classifiers. Then by the help of Sigmoid Function we calculate the probability. Highest probability takes a presidency and we classify that data to corresponding classifier.

The Code:

Execution of the Model:

In the Fit () method we have implemented one vs Rest algorithm as the data set demands a multi-classification model. We are iterating the code for distinct label times and then finding out the θ values by the gradient descent method. Similarly, we are also calculating the Cost function value with respect to the corresponding θ value. After we get the optimal θ value for each label type, by the help of Predict () method we are finding the max probability for a given feature using the θ values we initially found. Hence, we classify each feature input by obtaining the maximum probability for that input.

Plotting the Cost Value:

As this data set contains a multi-classification problem, we basically expanded our definition so that y = {0,1…n}. In this case we have 3 categories, so we have accordingly found cost values 3 different occasions by following One-Vs-All methodology. Hence, we have 3 different Plots for the corresponding cost values. Considering the values of Cost with respect to categories PFB the plot of Cost vs no of Iterations we have taken. We can clearly see the Cost value is reducing and heading towards the value Zero with each iteration.

We have now successfully implemented Logistic regression multi classification algorithm which can be used to classify more than 2 target variables.

--

--