Logistic Regression in Simple words

Published in

Analytics Vidhya

7 min readDec 1, 2019

Image by NASA — celestial view of earth atmosphere at night

In this article we are going to see about underlying concept of logistic regression and trying to explain it in simple terms with just elementary math.

Let’s get started..

Logistic Regression is one of the most popular and widely used supervised learning algorithms and classification techniques in machine learning for predicting the categorical variable.

Few real time applications are,

a. Fraud detection in online transaction (Yes or No)

b. Email classification (spam or not)

c. Cancer detection (caner tumour or not)

d. Rain Forecasting ..etc

Types of Classification

There are 2 types of classification

Binary classification — final outcome is binary , Ex- yes/No , 1/0 ,success/failure ..etc
Multi class classification — final outcomes are more than 2 possibilities , Ex- bad/average/good .

How it is different from Linear Regression ?

Linear regression is a mathematical technique to plot the line on data points with minimal average of deviation from actual value, so that we can use that line for predicting future input values.

On the the other hand, logistic regression is classification technique to separate the data points by plotting decision boundary based on its respective outcome values with minimal loss function.

Other than using a line to separate data points , linear regression and logistic are doesn’t have any commonality.

Okay! so how does it separating the data point mathematically? Let’s start by simple example so that we can understand the intuition of how algorithm works.

Assume that we are having a two training example — we have an example of tennis ball with weight of 85.5 gm and circumference 8.25 inches , where as another sample foot ball is having the weight of 450 gm and circumference 22 inches , Now our goal is to ask the logistic algorithm to classify the given sample based on its feature value of labeled data.

ball weight = Variable one = x1 = [ 85.5 , 450 ]

ball circumference = variable two = x2 = [8.25 , 22 ]

ball type = output = y = [tennis ball , Foot ball ]

Before going further , let’s visualise the data so that it will be easier to grasp the idea.

Remember, our aim is to plot the line which separates the data points based on its feature values (x1, x2 ), we will resist the temptation to plot classifier by our own instead we will seek the help of an algorithm because it can handle huge amount of training sample which is humanly not possible.

One way of achieving this is randomly plotting the multiple classifier and choose the one which performs well for both training data and test data.

this is one of the random line classifier we can plot but this is not actually separating the data points hence we cannot say which one is foot ball and which one is tennis ball. So, selecting classifier randomly won’t be an efficient way for solving problems especially when data size increases.

Here comes the sigmoid function which can actually help us to find the most accurate classifier by using its elegant mathematical properties.

Sigmoid Function

We can define the sigmoid function as follows.

g(z) = 1/1+e^(- z)

let’s understand the formula first . what sigmoid function does? In simple words, if you plug any value for z, sigmoid function will produce number between the range 0 to 1.

Don’t be afraid about the term ‘e’, it is know as “euler’s number “ which comes under transcendental numbers and has interesting properties in mathematics , please refer https://en.wikipedia.org/wiki/Transcendental_number for more information for now just think of it as constant number of value 2.71828182845 ……… goes on.

For example,

if z = 3, sigmoid function will produce value 0.9526 which is close to 1

if z = -3, sigmoid function will produce value 0.047 which is close to 0

z = 3
g = 1/(1+(np.exp(-z)))print(g)Output : 0.9525741268224334

By that you can see when z > 0 , g(z) will approach 1, when z<0 then g(z) will approach 0 and when z = 0 , g(z) is exactly 0.5 .

We can visualise the sigmoid function in below graph for the mix of -ve and +ve values .

z = [-10,-8,-6,0,2,3,4,10]f = []
for i in range (0,len(z)):
    g = 1/(1+(np.exp(-z[i])))
    #print(g)
    f.append(g)
print(f)f = [4.5397868702434395e-05, 0.0003353501304664781, 0.0024726231566347743, 0.5, 0.8807970779778823, 0.9525741268224334, 0.9820137900379085, 0.9999546021312976]

For the above value of ‘z’ , we will get the sigmoid function as below graph..

Intuition of using sigmoid function is to find right the decision boundary (separating line) , line equation θ^t * x ≥ 0 or θ^t * x < 0 , in either one of the way we can achieve the classifier which separates two different probability of events .

Hypothesis and cost function

Since we are going to plot the decision boundary as linear line or polynomial line, it is obvious that we will need a line equation. At the same time we want the line to be a classifier of probability of an event hence we are plugging the line equation in to sigmoid function.

Hypothesis for logistic regression is as follows,

h(x) = g(z) = g(θ^t * x) = 1 / (1+e^(-θ^t * x))

In logistic regression, cost function is the classification difference between actual outcome and hypothesis.

Logistic regression cost function is as follows

cost(h(x),y) = { -log(h(x) if y = 1 ; -log(1-h(x)) if y = 0}

If h(x) = 1 and y =1 then J(θ) = 0

If h(x) = 1 and y =0 then J(θ) approaching ∞

Now, let’s use the sigmoid function for our previous example to find the classifier.

We want g(z) ≥ 0 .5, it is possible only when z≥0

here z = θ^t * x

Hence , θ^t * x ≥ 0 ; θ^t * x = θ0+θ1x1+θ2x2)

We know the value of x1 and x2 but theta0, theta1 and theta2 are unknown and has to to computed by the optimisation algorithm (like gradient ), let’s assume that we have found parameters using gradient descend with the minimal cost function as below .

Let’s consider , parameter theta as → [9.7 , 2.09 ,-0.47]

We got the decision boundary equation as , 9.7 + 2.09 x1- 0.47 x2 ≥ 0 , by which we can split the datapoint as probability of two events .

We need only two end points to plot the decision boundary , hence above equation also can be rephrased as,

X2= (-9.7x0)-(2.09x1) / -0.47 = [ 57.49837741, 401.93286108]

X1 = min(x1), min (x2) = [8.25,85.5]

Finally, by plotting this line which connects the above two X1 and X2 , we will get the approximation of decision boundary which separates our data points (ball weight and circumference in example) as below:

Cool! We separated the two data points mathematically based on its probable outcome (here tennis ball, foot ball), so when ever we get a new input feature x1,x2 we can say whether it belongs to category of tennis ball or foot ball.

for example if we feed the new ball weight → 50 and circumference → 15.5, it will be hypothesised (assumed) as tennis ball because it falls below the decision boundary and vice versa for foot ball category.

The exact approach can be used for larger datapoint with many feature too but based on our goal and the nature of data, we may need to bend the line by adding polynomial and few other adjustment techniques, any how the underlying concept remains same,

“we need a good classifier to separate the data points into two or more based on the probable outcome , that same classifier will be used for unknown new data points for predicting its outcome“.

End Note

Logistic regression is one of the widely used classification techniques, it may look bit complicated when compare to linear regression, but once we understand and implement it with the some small dataset, we can realise how elegant and simple technique it is to classify and predict the categorical variable.

Also refer my previous blog to understand Linear Regression in NumPy .

keep hustle for the better future !!