Introduction to Logistic Regression

Nishad S
Analytics Vidhya
Published in
6 min readFeb 16, 2021

Have you ever wondered why some people are extended loans or credit cards by banks while some others are not? How do banks know with some amount of surety that the individuals to whom they provide loans or credit cards will not default on their loans or will not be delinquent on their credit card payments? The answer is simple — they rely on what is called a credit score. A higher credit score means you are highly credit-worthy and they can safely extend a loan to you whereas a lower credit score means you have a very low credit-worthiness and it is risky to give you a loan as there is a good chance that you will default on your loan repayments.

One of the biggest clients of my company is a credit bureau company based in the United States of America and I work for that client. They are in the business of building credit scoring models for banks and other financial institutions. And can you guess what models they develop to calculate credit scores? Yes, you guessed it right — Logistic Regression models!

If you are new to machine learning, for now, understand that there are mainly 2 types of machine learning techniques — supervised and unsupervised. Supervised algorithms work with labeled data which means your data already has a target variable and your algorithm tries to learn from this target. When it comes to unsupervised algorithms, there is no labeled data. For the sake of this topic, let us not digress and just stick on to supervised algorithm.

Supervised learning algorithms have two main use cases — regression and classification. Regression refers to predicting a continuous variable, for example predicting house price based on certain features of a house. You probably know the equation y = mx + c from your high school maths class. Well, that equation is a simple linear regression model. Classification deals with classifying a categorical variable, for example classifying a person as to whether he will default on his loan or not.

Logistic regression is a supervised machine learning algorithm that is used for classification problems. Wait! A few moments ago, I said regression means predicting a continuous variable but now, I am saying logistic regression is used for classification. Did I get it wrong? No! you will find that out shortly.

Assume that we have a dataset that contains the historical credit data of many individuals and one of the variables in the dataset says whether a particular individual defaulted on his loan or not. We will call this the target variable. Now, let us choose one independent variable from the data — number of active credit cards. When you closely observe the data, you find that individuals who have more number of active credit cards tend to default on their loan payments. Seems logical, right? Since there is a linear relationship, we can fit a regression line to this data as shown in the figure below. The 0 on the y-axis represents “not defaulted” and the 1 represents “defaulted”.

Now, here we come across a situation. The regression line can extend from -∞ to +∞. That somehow does not seem correct to our case in hand. The default status can only be a 0 or a 1. This is where a mathematical transformation called sigmoid transformation comes to our rescue. It converts the output values from the regression into probabilities and transforms the best fit line or the regression line into a S- shaped curve using the below formula where y stands for the output from the regression line and p stands for the probability.

p = 1/1 +exp(-y)

One end of this curve reaches 0 when the output from the regression line reaches -∞ and the other end of this curve reaches 1 when the output from the regression line reaches +∞. The transformed curve looks like the below picture. The x-axis shows the output from the regression line and y-axis shows the probability of default on the loan.

Now, we will set a threshold at 0.5. If the probability of default is above 0.5, we will say that the person will default on his loan. If the probability is below 0.5, we will say that the person will not default on his loan. So, how do you find the best S-curve so that we end up classifying a person correctly as defaulter or a non-defaulter most of the time. This is achieved by minimising the following loss function called as logloss.

Logloss = -1/N (∑(yi(logpi) + (1-yi)log(1-pi)))

At first look, the above function looks intimidating but it is actually pretty simple. Let us see how this formula works with the help of the following 4 cases.

Case 1: A person is actually a defaulter and through the S-curve, we found that the person has a high probability to default. Here, y = 1 and p is a very high value. Let us consider the p to be 1. So, logp will become log1 which is 0. Substituting these values in the logloss function gives a value of 0 which means that the loss is zero and hence the data point was correctly classified.

Case 2: A person is actually a defaulter but through the S-curve, we found that the person has a low probability to default. Here, y = 1 and p is a very low value. Let us consider the p to be 0. So, logp will become log0 which is -∞. Substituting these values in the logloss function gives a value of ∞ which means that the loss is huge and hence the data point was incorrectly classified.

Case 3: A person is actually a non-defaulter and through the S-curve, we found that the person has a low probability to default. Here, y = 0 and p is a very low value. Let us consider the p to be 0. So, log(1-p) will become log1 which is 0. Substituting these values in the logloss function gives a value of 0 which means that the loss is zero and hence the data point was correctly classified.

Case 4: A person is actually a non-defaulter but through the S-curve, we found that the person has a high probability to default. Here, y = 1 and p is a very high value. Let us consider the p to be 1. So, log(1-p) will become log0 which is -∞. Substituting these values in the logloss function gives a value of ∞ which means that the loss is huge and hence the data point was incorrectly classified.

So, now you probably found out that the loss function was not that complex! And above all, now you would have understood why this classification algorithm is called Logistic “Regression”. The sigmoid curve is actually a transformed regression line!

Tip: If you want to have a very good credit score, take loans and use your credit card often and never default on the repayments.

--

--