The Math and Intuition behind Logistic Regression

Published in

Analytics Vidhya

25 min readAug 26, 2019

Logistic regression(LR) is one of the most popular classification algorithms in Machine Learning(ML). Even the name suggests it is regression but still used as classification, it is extensively used when the application has low latency requirements and it is also well known for its nice feature interpretability. LR does not require high computational resources unlike KNN, Random Forest, XGBoost. However, performance can be worse when the assumption fails. Let us go deep and understand how the math behind LR works.

Let’s first understand how a human can classify?

Have a look at the following beautiful portraits, which are at two different locations A and B.

What if someone asks you to predict at which location there will be higher rainfall after 1 hour? On looking at these two locations, you can definitely tell at “Location B” there will be a higher chance of rainfall.

Now just interpret yourself, how did you judge there is a high chance of rainfall at Location B.

Yes, of course, you might say something, like there are dark clouds at Location B, the presence of dark clouds, is a very good sign of happening rainfall. Hence the answer is Location B, which is absolutely perfect. But how did you know the fact “presence of dark clouds is a very good sign of happening rainfall”. This fact we have learned from our past experiences on observing many similar kinds of situations.

So we (humans) can judge some simple (like above) future actions by applying the knowledge that we learned from our experiences. Every Classification/Regression algorithms also judge/predict the future outcome in a similar way, once after it learns the hidden pattern from a large amount of data, like the one (in above example) we have learned presence of dark clouds are a very good sign of rainfall, but these algorithms can make use of very complex data to predict more efficiently.

Take a look at the following images. If we want to predict which location has the highest chance of rainfall in the next 1 hour, based on only a single feature “presence of dark clouds”.

Here, in this case, it is a somewhat complicated task to predict because the two locations have a similar cloudy climate. In the above two examples, we have only one variable to judge/ predict about rainfall which is the “Presence of Dark clouds”. Suppose, for every location you have provided with a few more recorded information something like
. Wind speed
. Humidity
. Temperature
. Atmospheric pressure
. Distance from the coast
. Presence of dark cloud etc.

On considering all these variables for a human it is difficult to judge but for an ML classier it is a very simple task only when we have large enough data. We can train the classifier model, and the classifier learns from data more efficiently than humans and predicts the outcomes based on given present conditions.

A high-level overview of how an ML classifier works?

Consider the following pseudo data as historical data collected from the last few years, which may contain a few thousands of observations as shown below.

Here we have the data with 6-variables and one dependent class label. To build an ML classifier, about 70–80% of data can be utilized to train the model, and rest of 30–20% of data can be used to test it. Once after achieving the best results, we can use this model for future prediction based on present conditions. Below is the flow diagram describing the same thing.

The Logistic Regression Classifier

Yes, the name looks ambiguous but not its concept. Before understanding it, let's understand what is Regression and classification.

Classification: The model that generates the output that will be limited to some finite set of discrete values.

Example: Consider our ongoing example where the task is to find in the next coming one hour Rain will fall or not? Here Output can only be limited to ‘1’ or ‘0’.

1: YES

0: NO

Regression: The model where its output can be a continuous value that cannot be limited to some set discrete values

Example: Let’s say the task is to find the amount of rain that will fall in the next coming 1 hour, Here we cannot limit the amount of rain to some set of numbers. This can be any floating number.

For both Classification and Regression, it requires the class label, Hence these algorithms come under the category called Supervised learning, which in other words these algorithms required previously recorded data that contains the class label.

Even the name Regression includes in the LR, by end of its mechanism it acts as the classifier.

Not only just rainfall prediction, here are the few scenarios where we can use classification algorithms like LR.

Imagine you want to develop a health care mobile application in your business that can predict the chances of Heat attack after 3 years based on present symptoms of an individual
Assume, in your e-Commerce business you required to know customer satisfaction from his comments, you can build a system using LR which tells you whether the customer is extremely happy, satisfied or disappointed.
For some type of disease diagnosis like cancer, doctors will put a lot of effort and time to analyze multiple test results to confirm the final result. In such cases, one can develop an LR-based model that will predict the result, and also it provides the reason for the results within a short time.

However we cannot generalize these are only the best cases to apply, LR is the excellent choice only when its assumption is true, the beauty of LR is, it not only predicts the result but also provides the confidence of the predicted result , and it provides the reason for the results within a very less time.

How LR can make decisions

Now let us consider a new example assume we have some new data if I plot it on 2-D space it may look like this as shown in the figure below.

For better understanding, assume the plotted data belongs to “Two classes (Yes/No) of points belongs to a set of a person that indicates if the person will have the chance of heart attack after 5 years or not, based only 2 variables ( say that can be “Age”, “cholesterol level” ).

By looking at the above scatter plot, we can say all positively diagnosed points are upper side in 2d space and negatively diagnosed points are on the lower side of the space. The same thing let me conclude more efficiently by taking one perfect reference.

Imagine if I find the best line that separates the red and green dots. So that I can say points belong to one side of the line is one kind and points belong to the other side of the line is another kind.

That’s it, the main role of LR is to find the best liner surface (line/plane/hyperplane) that can almost separate the two different classes, which is also called decision surface. Finding the best line is nothing but finding the equation of that line. The generic form of the equation of a line is ax+by+c=0. In this example, finding the equation of the line that separates the classes, which mathematically means that we have to find values of [a b c], where x and y take the values of “Cholesterol level” and “Age ”.

The data I plotted in the above example was using only two variables, hence it was perfectly fitted in 2d space. In real-time we can have few thousands of such variables where a line/plane not sufficient to separate in such a higher-dimensional space. To make it general let us stick to the notation representing linear surface in the d-dimensional, which is an equation of hyperplane

Equation of hyperplane

Using linear algebra we can rewrite the same thing as follows.

Where

To find the best hyperplane, we should find the values of constant C and W vector, Which also referred as weight vector, and w1,w2,w3,…., wd called as weights of the plane. During the training phase of the LR model, it tries to find and learn the best weight values which can almost separate the two variety of points. Since the LR decision purely based on the reference with a linear surface, it performs well if and only if the data is almost all linearly separable. Hence we take this assumption before applying LR.

To calculate the weights we require some objective function which will also useful to reduce the loss in every iteration.

Let’s understand how to get the objective function for LR

To derive the objective function we are taking the help of geometry, as the LR generalized in higher-dimensional space, imaging geometry in higher-dimensional (>3-d) cannot be possible for a human. So we try to visualize the total concept in 2-d space and extend the same understanding into higher dimensional space.

•Imagine this is d-dimensional space and there is a hyperplane with weight vector ‘W’ separating the two variety of points.

Since it is higher-dimensional space we can not simply judge that points above the hyperplane belong to one class or points below of the hyperplane belongs is another class, this may not work all the times in d-dimensional space. The generalized idea is to consider the direction of W vector and to conclude points lie in the same direction as W then it is positive point and the point lie in the opposite direction of W then it is a negative point.

Assume the hyperplane is passing through origin and weight vector is a unit vector. Then the equation of the hyperplane will be as below.

Let “xi” be any point such that distance between the point and the hyperplane is given by

since w is a unit vector

If the distance from ‘xi’ is positive which means point lie towards the direction of W then its corresponding prediction is “yi” = +1. If the distance from ‘xi’ is negative, then the point lie opposite to the direction of W then “yi” = -1.

If I want to check one query point “xq” on the model given that we already know what is actual ‘yq’ for this ‘xq’, so we have

Suppose, if we got the distance from “xq” to hyperplane as any positive value then the predicted value will be +1 and also we have actual value as “yq”=1, hence

Let’s also think this in another case, now say we have “yq”= -1 as an actual value.

If we got negative distance value from “xq” to the hyperplane then the predicted output will be -1, which is correctly classified by the model.

Hence for a correctly classified point, the product of distance and the actual value is always positive.

In the same way for incorrectly classified point, always

Incorrect classification +1 x -1 = -1

Our main aim is to find the hyperplane such that it can separate the points correctly as much as it can. For a considerable model, it is necessary to maximize the correct no.of classifications and reduce the no.of miss-classifications. Writing the same thing into math.

The above equation is called the optimization problem where it will find the best W on varying its value for every iteration. Intuitively it means that it is trying to find the best value of W such that it will produce the maximum value to the above summation that will obtain from all n points in the train data.

The thing here is ideally the more summation value it became, the lesser miss-classification will achieve. Hence we are finding the best “W” which will make the summation to the maximum value.

Here for LR, we do not consider the above equation as it is, because it has some practical reasons, have a look at the following statements.

So far we are calculating the distance from a point to the hyperplane, based on the sign obtained (by neglecting its magnitude) we are judging its class, and the magnitude value also cannot give you how confidently the queried point is +ve or -ve.
There will be a higher chance of outlier points in any kind of data, magnitude value of the distance is relatively very high for these outliers, even a very few outliers can impact more on finding the optimal hyperplane.
Distance from a point to the plane can have any value between (- ∞, ∞). As an output it is giving real values, at this stage LR acts like a regressor, we need to limit this output range to some level by considering some threshold to achieve the final classification task.

To satisfy all the above requirements we should squash the values. For squashing the values we use Sigmoid function, the reason we use only Sigmoid function because it will give the very nice probabilistic interpretation, which in another way for any number as input sigmoid always returns the value in the range between [0,1].

Here

Based on the sigmoid output the decision will be as follows

Therefore the optimization problem after applying sigmoid is like the below.

Now applying the log( ) into an optimization problem. Since the log( ) is monotonic function it will not affect the overall optimization, log( ) has nice properties that it can convert multiplication to addition and fraction to subtraction.

The reason we are applying the log( ) because here our main goal is to find the best W by optimizing the above equation, having said that after applying the sigmoid we get values in-between [0,1] which there will be higher chances of getting very small decimal values, when we add up the values for all the data points it may create numerical instability. When we apply the log() it takes care of numerical computation during the optimization without effecting the goal of optimization.

The same equation can be written as follows by using logarithmic properties.

This equation can be projected as an optimization problem, which can be solved by using Gradient Descent kind of algorithms like SGD, When we are training the model with large data using the above optimization we can get the best W, which means the equation formed by W can separate all the train data very well which may include outliers and noise points. But it may not work well on future unseen data points this also called overfitting. To control this effect we will add some regularization term to the above equation.

This is our final objective function for LR

•Here λ is called hyperparameter which means we can control the impact of regularization using this.

•If the λ increases more then the model also tends to under fits more means the performance on train data itself is worse.

•If the λ decreases then the model tends to overfits, hence choosing the optimal λ is very important, this can be determined by tuning the hyperparameter using cross-validation technique.

•In the regularization term, we used the product of λ and L2 norm of W, here we can also try with L1 norm of W but L1 norm will produce zeros for all less important features in W vector which means it creates more sparsity than L2 norm.

This final objective function for LR which will be now solved by using SGD algorithm. By looking at the equation we can tell, it is a sum of Loss term and a regularization term. This loss is also called a Logistic loss which is an approximation for 0–1 loss function (see the graph below).

The graph showing Logistic-Loss and 0–1 Loss

What happens internally when we Train and Test LR

Well, when we are training the LR with data, the objective function internally tries to minimize the log-loss, at the same time it will keep on updating the weight values for every iteration until it meets the convergence. Here is the graphical interpretation showing that what usually happens during the training phase of LR.

A simple graphical interpretation of Logistic-Loss Reduction for each iteration during its training phase

A simple idea of how the LR learns the best weights from every iteration during training

Once after we are done with the training phase, we will have the final W values that can separate almost all data based on its class. When we pass any point to test, this point will be substituted in the equation of the plane formed by the W and the obtained value will be passed to the sigmoid function which will provide the final classification result.

A mathematical calculation to predict the class value for each test point

Time and Space complexities

During the training phase, using objective function optimization will happen by utilizing every point in train data. Say we have ‘n’ points in train data with ‘d’ features. Then train time complexity is O(nd)

During the testing time, we need to store only a final weight vector which is an array of length ‘d’. Hence the test space complexity is O(d). For a single test point, we will do “d” multiplications and one addition after that we will pass to the sigmoid function. Hence the time complexity is O(d).

Since the run time and space complexities are very small when ‘d’ is small LR can achieve low latency requirements very easily.

Understanding Sklearn’s LR

Sklearn is a python open-source library that provides complete functional support for modeling ML classification, Regression, and clustering, which means most of the algorithms are already pre-defined and available in sklearn.

For applying LR as well, we typically don’t prefer to write entire working algorithm explicitly. We simply instantiate LR class from sklearn package and we work on its parameters so that it can provide the best fit to the current data.

Sklearn implementation of LR; all ***cursively bolded*** are names of the parameters and all values indicated with ***green*** are its default values.

Please note that here I am explaining the 0.21.3 version of sklearn. If you are looking for the later version or older version, please go through the sklearn’s documentation here.

To understand this in a better way I am considering the simple data set called Heart Disease UCI that contains 13 features and one target variable that will provide the information for the patient whether he/she attacked by heart disease or not. This data was taken from this kaggle link. To know more about the data please click here. All required exploratory data analysis and data pre-processing were done and train data, test data were stored in variables X_train and X_test respectively.

The objective function for Logistic Regression.

Using the above function we need to get the best values of ‘W’ such that those values can considerably classify the data (let’s not forget our objective). As we discussed, to get the best ‘W’ on any data it is necessary to tune the parameters, that can be a lambda, type of norm in regularization, number of maximum iterations, tolerance for converging the iteration, etc. Thankfully sklearn enable us to do all these operations, so here I am explaining how sklearn’s LR and the objective function is related and also let’s see how these parameters will affect our model efficiency.

To operate ‘λ’ value we have ‘C’ in sklearn’s LR

As we discussed in detail about ‘λ’, to get a perfect trade-off between low-bias and high-variance, its value should be appropriate, this variable is available as ‘C’ in sklearn on holding the relation λ = 1/C. Let’s try with some values and observe how it affects model underfitting/overfitting.

#case 1:
# 'λ' = 0.001 checking with small value
Lambda = 0.001
clf = LogisticRegression(C=1/Lambda ) #instantiating LR into "clf"
clf.fit(X_train, y_train)
evaluate_this_model(clf)

loss on the train and test data when λ = 0.001

We considered λ = 0.001 which is relatively less value, hence we can realize little overfitting when you observe the log loss of test and train data sets.

#Case 2:
# 'λ' = 100000 checking with large value
Lambda = 100000
clf = LogisticRegression(C=1/Lambda) #instantiating LR into "clf" 
clf.fit(X_train, y_train)
evaluate_this_model(clf)

loss on the train and test data when λ = 100000

Here I have chosen a very large value for ‘λ’ to show how it actually under fits, we can see the losses on both the test and train sets are very high, also observe the weight vector all most all the values are very close to zero.

Choosing Norms of the weight vector

In sklearn the regularization term by default ‘λ’ will scale up the L2-norm of W but this also can be chosen as L1-norm when you set LogisticRegression(penalty = ‘l1’) then regularization term will be as follows.

Let’s see how these norms can affect the final prediction weight vector ‘W’ .

Norm = "l2"  #("l2" is defult norm value)
Lambda = 100
clf = LogisticRegression(penalty=Norm) 
clf.fit(X_train, y_train)
evaluate_this_model(clf)

Now just changing penalty to L1 norm by keeping all rest of parameters as it is and let’s observe the result vector in both cases.

Norm = "l1" 
Lambda = 100
clf = LogisticRegression(penalty=Norm,C=1/Lambda) 
clf.fit(X_train, y_train)
evaluate_this_model(clf)

On looking at above two vectors we can realize that there are more zeros when penalty= ‘l1’. This will justify L1-norm will cause more sparsity by making all less important features weights to zero.

Choosing the type of algorithm solver

Sklearn can solve the objective function in different ways. It can use different algorithms for the same optimization. Sklearn enables to choose the type of algorithm by operating the “solver” parameter, “solver” parameter can take ‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’ which are the different-different algorithmic style to optimize the objective function. Default solver is ‘liblinear’.

Even though the objective is same, these algorithms differ in adding a penalty, performing on the small data sets and performing on multi-class classification. But an immediate question is when to use which solver, here are the nice summery points captured from sklearn documents, that describes which solver has which advantage.

For suppose if we use solver = ‘sag’ it enables us to penalize ‘elasticnet’ regularization, which it means we can add the penalty to the loss-term with the combination of L1 and L2 regularizers.

This combination can be operated with another parameter i.e. “l1_ratio” its default value is set to None. If we set l1_ratio =1 then it is equivalent to setting penalty = ‘l1’ , if we set l1_ratio =0 then it is similar as penalty = ‘l2’ , and we can set any value in between (0,1) if l1_ratio =0.8 which in mean L1-norm will have 80% impact and rest 20% impact is from L2-norm. Let’s see a case with this.

Norm = "elasticnet" 
Lambda = 100
algo_style="saga"clf = LogisticRegression(penalty=Norm,C=1/Lambda,l1_ratio=0.3,solver = algo_style)clf.fit(X_train, y_train)
evaluate_this_model(clf)

Weight vector when **L1-**norm contributes 30% in the combination of L1+L2.

Norm = "elasticnet" 
Lambda = 100
algo_style="saga"clf = LogisticRegression(penalty=Norm,C=1/Lambda,l1_ratio=0.8,solver = algo_style)clf.fit(X_train, y_train)
evaluate_this_model(clf)

Weight vector when **L1-**norm contributes 80% in the combination of L1+L2.

As we increase the l1_ratio, the sparsity will increase because of the impact created by L1-norm will increase with l1_ratio.

Choosing the intercept term

“fit_intercept” and “intercept_scaling” are used when you have any specific preference to increase/decrease the impact of intercept value. When you set fit_intercept = False we will get the weights of the equation that will pass through origin which means the intercept value in the equation of hyperplane will be zero.

Lambda = 100
clf =    LogisticRegression(C=1/Lambda,
intercept_scaling=0,fit_intercept=False) 
clf.fit(X_train, y_train)
#evaluate_this_model(clf)print("Intercept value is: {} ".format(clf.intercept_))
print("\nAnd weights vaector is : ")
(clf.coef_[0])

weights and intercept values when we set **fit_intercept=False**

By default sklearn sets fit_intercept = True, hence we will get an intercept term, note that we already regularizing (L1/L2) to the required weights, so if we want to increase the impact of intercept term in the final prediction we can scale it up by using “intercept_scaling”, which it internally reduces the effect of regularization over the weights that will return the reduced weight values.

So , as we increase the intercept_scaling value it will increase the impact of intercept which indirectly reduces the impact of weights.

Lambda = 100
clf = LogisticRegression(C=1/Lambda,intercept_scaling=100,fit_intercept=True) 
clf.fit(X_train, y_train)
#evaluate_this_model(clf)print("Intercept value is: {} ".format(clf.intercept_))
print("\nAnd weights vaector is : ")
(clf.coef_[0])

Lambda = 100
clf = LogisticRegression(C=1/Lambda,intercept_scaling=10000000,fit_intercept=True) 
clf.fit(X_train, y_train)
#evaluate_this_model(clf)print("Intercept value is: {} ".format(clf.intercept_))
print("\nAnd weights vaector is : ")
(clf.coef_[0])

Look at the above two cases when we change intercept_scaling value to higher value the values in the weight vector was decreased, which it implies that the impact of the intercept term is increased on reducing the weights.

Controlling the iterations

During the training, the algorithm tries to minimize the loss. It always checks its convergence on computing the difference between loss at present iteration to its previous iteration, this residual value is called Tolerance.

If the algorithm meets the given tolerance value then it stops to train at that particular iteration. In sklearn the default value for tolerance is given by tol=0.0001, if we give large ‘tol’ value it will stop early thus it may results in bad classification, if we choose very less value then the algorithm will take more time for its convergence. We typically use the default value for tolerance.

We can also choose the maximum number of iterations using max_iter parameter, where we typically increase when we have a very large amount of train data, the default value is max_iter =100.

Lambda = 100
clf = LogisticRegression(C=1/Lambda,max_iter =1000, tol=1e-3) 
clf.fit(X_train, y_train)
evaluate_this_model(clf)

In the above example, we have given max_iter =1000 and tol=1e-3, which means that the algorithm will take a chance up to 1000 iterations to get the given convergence value.

Let’s see what happens if I gave higher tolerance value

Lambda = 100
clf = LogisticRegression(C=1/Lambda,max_iter =1000, tol=3 ) 
clf.fit(X_train, y_train)
evaluate_this_model(clf)

Observe the losses and weight vector values, here I gave tol = 3, which is a significantly larger value hence this is stopping the algorithm in very first iteration thus it results in higher loss values and zero weights.

Handling imbalanced data sets

Generally, when we have imbalanced data, we need to take care of it by applying techniques like over-sampling/under-sampling, when we use sklearn library for modelling we can develop the same impact of balancing using class_weight parameter.

When the data have imbalanced classes, we will set class_weight =‘balanced’. So that the model will assume that it is fitting on balanced data. This parameter also accepts input in dict format class_weight = {class_label: weight} where we can explicitly define the balanced ratio to the classes.

clf = LogisticRegression(class_weight = 'balanced') 
clf.fit(X_train, y_train)
evaluate_this_model(clf)

Model weights and loss when **class_weight = ‘balanced’**

clf = LogisticRegression(class_weight = None) 
clf.fit(X_train, y_train)
evaluate_this_model(clf)

Model weights and loss when **class_weight = None**

Here we can not observe any significant difference between the above two cases, because fortunately, our data set is already balanced. If not we can increase the strength of the weak class using class_weight.

Other parameters in sklearn LR

dual:

The objective function so far that we have seen is called a primal formulation, there is another formulation for LR objective function using Lagrange multipliers which also called as Dual formulation. In sklearn, we have a facility to use both dual and primal formulation by using “dual” , which is again a functional parameter. By setting “dual = True” the algorithm solves dual formulation, by default it is False which in mean it uses primal formulation. Typically we prefer dual=False when no. of samples > no. of features. Please note that dual formulation is only implemented for penalty =‘l2’ with solver =‘liblinear’

n_jobs :

This parameter gives the facility to run the fitting job in parallel. If you choose n_jobs = 2 then 2 cores in your system work parallelly for the same task. When you choose n_jobs = -1 all the cores in the system will work parallelly and thus helps in reducing the computation time.

random_state :

This ensures the algorithm to control the randomness, the value we give to random_state is used as a seed to the random number generator. This will make sure the all the randomness involved in the algorithm are generated in the same order.

multi_class :

If we have a binary class label then sklearn automatically fits the data with one vs rest(ovr) strategy. If in case we have multi-label in our data then we select the “multinomial” option which internally tries to reduce multinomial log-loss.

verbose:

This parameter is used to get the verbosity of the algorithm. It helps to display the produced messages during its optimization. We can pass an integer value to it, if we choose large integer value we will see more no. of produced messages.

warm_start

As we discussed earlier to determine the best model we need to experiment the fitting with multiple values of hyper-parameters and regularizations by using sklearn’s grid search cv which it fits the estimator repeatedly on the same data set for different values, so what if we want to reuse the previous model learnings for present learning. It is possible when you set warm_start = True by default it is set to False.

However, experimenting with all these parameters one by one is really a big task. So we choose any CV technique provided by sklearn and we will give set of values in a single shot. This CV algorithm will return the best fit from the provided values. Look at the code below.

parameters={'C':[10**-6,10**-5,10**-3,10**-4, 10**-2, 10**-1,10**0, 10**2, 10**3,10**4,10**5,10**6],
            'penalty':['l1','l2'],
            'tol':[0.0001,1e-4,1e-5,0.01],
            'fit_intercept':[True,False],
            'intercept_scaling':[0.1,0.01,1,10],
            'warm_start': [True,False]
            } #Setting all parameters in a single pipelineclf_log = LogisticRegression(n_jobs=-1)clf = GridSearchCV(clf_log, parameters, cv=5, scoring='neg_log_loss',return_train_score =True,n_jobs=-1,verbose=5)
clf.fit(X_train, y_train)train_loss= clf.cv_results_['mean_train_score']
cv_loss = clf.cv_results_['mean_test_score']

After cross-validation, GridSearchCV returns the best fit out of all provided parameters.

clf = clf.best_estimator_
clf

This model is again trained with the same parametric values and also tested with unseen data.

Please refer this github link for detailed code, that I have used to explain in this example, also refer here to access sklearn’s official documents.

Now let’s apply LR on real Rainfall data, and see how it works

As I started this blog by explaining some pseudo-rain-fall example, I would like to end with a similar rainfall example.

Now let’s work on real rainfall data, train and test the LR model, and also achieve the total interpretation about the prediction using the same python libraries. Here I am considering the recorded data of rainfall in Australia.

This data was taken from this kaggle link. This dataset contains daily weather observations from numerous Australian weather stations with 24 variables. Here our objective is to predict whether on tomorrow it will rainfall or not?

You can explore the more details about the data at this kaggle link

Before applying any model it is very important to analyze the data to get a better understanding by knowing its size, what kind of features it has, data is balanced/imbalanced, any outliers or missing values, any feature scaling/ feature transformations are required, etc., All these mentioned steps typically cover when you perform Exploratory data analysis, Data preprocessing which are very important stages before modeling the data. I preprocessed all the data and was split into train and test sets. Here is the code for hyperparameter tuning for logistic regression using sklearn’s Gridsearchcv

#taking different set of values for C where C = 1/λ
parameters={'C':[10**-6,10**-5,10**-4, 10**-2, 10**0, 10**2, 10**3]}#for plotting
log_c = list(map(lambda x : float(math.log(x)),parameters['C']))#using sklearn's LogisticRegression classifier with L2- norm
clf_log = LogisticRegression(penalty='l2') # hyperparametertunig with 5 fold CV using grid search
clf = GridSearchCV(clf_log, parameters, cv=5,scoring='neg_log_loss',return_train_score =True)
clf.fit(X_train, y_train)

train_loss= clf.cv_results_['mean_train_score']
cv_loss = clf.cv_results_['mean_test_score'] 
#A function defined for plotting cv and trian errors 
plotErrors(k=log_c,train=train_loss,cv=cv_loss)

By looking at the graph we can observe how the negative log-loss is increasing during the hyperparameter tuning.

clf = clf.best_estimator_
#Trainig the model with the best value of C
clf.fit(X_train, y_train)

From the GridsearchCV, considering the model that has better bias and variance trade-off, and training that optimal model.

Now the training part is done, let us check the model performance on test data.

#Printing the log-loss for both trian and test data
train_loss = log_loss(y_train, clf.predict_proba(X_train)[:,1])
test_loss  =log_loss(y_test, clf.predict_proba(X_test)[:,1])


print("Log_loss on train data is :{}".format(train_loss))
print("Log_loss on test data is :{}".format(test_loss))

By looking at the above log-loss values we can tell model is not suffering from low-bais/high-variance but using log-loss we cannot tell how best the model is, hence checking with AUC metric

#Plotting AUC  
train_fpr, train_tpr, thresholds = roc_curve(y_train,clf.predict_proba(X_train)[:,1]) test_fpr, test_tpr, thresholds = 
roc_curve(y_test, clf.predict_proba(X_test)[:,1]) plt.plot(train_fpr, train_tpr, label="trainAUC="+str(auc(train_fpr,train_tpr)))
 
plt.plot(test_fpr, test_tpr, label="test AUC ="+str(auc(test_fpr, test_tpr))) 
plt.legend() 
plt.xlabel("FPR") 
plt.ylabel("TPR") 
plt.title("ROC for Train and Test data with best_fit") plt.grid() plt.show()

On looking at test-AUC we can understand that for a new query point there is 87.42% of the chance that the model can able to predict its original value.

clf.coef_[0]

The above array indicates the final weight vector obtained once after the training phase is done.

These values will help us to interpret the feature importance, which means in the more weight value will influence the more in the task of classification.

feature_weights =sorted(zip(clf.coef_[0],column_names),reverse=True)

Above is the array of sorted features based on its weight value. Higher weight value means higher important feature it is.

Now just interpret the model result by sending one new query point with below feature values.

#Giving one query point here  
MinTemp   = 26.2 
MaxTemp   = 31.7 
Rainfall   = 2.8 
Evaporation   = 5.4 
Sunshine   = 3.5 
WindGustDir   = "NNW" 
WindGustSpeed   = 57 
WindDir9am   = "NNW" 
WindDir3pm   = "NNW" 
WindSpeed9am   = 20 
WindSpeed3pm   = 13 
Humidity9am   = 81 
Humidity3pm   = 95 
Pressure9am   = 1007.2 
Pressure3pm   = 1006.1 
Cloud9am   = 7 
Cloud3pm   = 8 
Temp9am   = 28.8 
Temp3pm   = 25.4 
RainToday   ="Yes"point = [MinTemp,MaxTemp,Rainfall,
         Evaporation,Sunshine,WindGustDir,
         WindGustSpeed,WindDir9am,WindDir3pm,
         WindSpeed9am,WindSpeed3pm,Humidity9am,
         Humidity3pm,Pressure9am,Pressure3pm,
         Cloud9am,Cloud3pm,Temp9am,Temp3pm,RainToday]

xq=dict()
for i,name in enumerate(column_names):
    xq[name]=point[i]"""will_rain_fall_for_this_conditions is function defined to do all pre-processing steps and to predict output from classifier"""will_rain_fall_for_this_conditions(xq)

Printing the classifier result with confidence value and interpreting the result based on feature importance.

Using feature weight values we can print the result to the end-user as shown above.

Click here for the complete source code for this example from my GitHub profile where it contains all the code for exploratory data analysis, data pre-processing and modeling.

References: