Deep Dive Into Logistic Regression and Data Pre-Processing

Vardaan Bajaj

Published in

Analytics Vidhya

16 min readJun 19, 2020

In this post, we’ll be going through:

1. The Problem Solved By Logistic Regression

2. Activation Functions

3. Cost Function for Logistic Regression

4. Gradient Descent for Logistic Regression

5. The Need for Data Pre-processing

6. Techniques of Data Pre-processing

7. Solving the Titanic dataset on Kaggle through Logistic Regression

In the previous post, we looked at Linear Regression Algorithm in detail and also solved a problem from Kaggle using Multivariate Linear Regression. In this post, we’ll be looking at another regression problem i.e. Logistic Regression. Later in the post, we’ll be discussing about data pre-processing as data pre-processing is quite an important and the most time consuming step of any machine learning task.

Logistic Regression is a type of supervised learning problem where the output values are discrete i.e. the output is a fixed number of classes. Logistic Regression problems are of 2 types:

(i) Binary Classification

(ii) Multi-class Classification

Binary classification is the task of classifying the input data into 2 groups whereas multi-class classification is the task of classifying the input data into more than 2 groups. Pretty much self-explanatory, right. To understand things, we’ll be looking at binary classification and later on, we’ll generalize logistic regression to multi-class classification.

Binary Classification

Let’s denote the 2 output classes by 0 and 1. Our logistic regression model cannot output the values 0 and 1 with 100% confidence as we know from the previous post that the model should only learn the patterns (general rules) and not go into specifics of each training example, because if it does, then we’re probably dealing with overfitting. So, we can expect that our binary classification model outputs a value between 0 and 1 i.e. the probability of each class and the probabilities for all classes (here 2) should add up to 1. In this case, that class is assigned to the output which has the higher probability.

Now, the obvious question of how to let the model output probabilities arises. In linear regression, we saw that our output was depicted by the equation y = Wx + b, where y can have any real values. If we use this equation as our baseline, then to convert this equation’s output into a probability, we need a function that can input any real number and outputs a value between 0 and 1. The function we use for this purpose is called the activation function.

Activation Functions

Consider the final output of 0 or 1 as a switch, where 0 is the ‘off state’ and 1 is the ‘on state’. The probability values that we input to the activation function can be thought of as current (electricity) values. When the current is above a certain threshold, our switch gets ‘activated’ to the on state otherwise, it stays in the off state. Various activation functions mimic this behaviour and the most common one is the sigmoid activation function.

For binary classification, the sigmoid function inputs the output of the equation y = Wx + b and then outputs the probability value for each class. In the formula shown above for sigmoid function, we see that for x = 0, the output of the sigmoid function is exactly 0.5. So, for all values of x > 0, the output switch is activated and the corresponding input data is classified as belonging to category 1, otherwise it belongs to category 0.

Other examples of activation functions are:

(i) Hyperbolic Tangent function (tanh) — The tanh(x) is defined as (e^x-e^-x)/(e^x+e^-x) and has values in the range of (-1, 1).

(ii) Rectified Linear Unit function (ReLU) — The ReLU activation function is a smartly designed non-differentiable activation function as they are much easy to compute and reduce computational cost significantly. The ReLU activation function is widely used in deep learning problems.

(iii) Softmax Function — Softmax activation function works well with multi-class classification problem and computes the probability of the occurrence of each class. The class corresponding to the highest probability value is then considered to be the output class for the input.

Now that we know the problem logistic regression deals in and how activation functions work, let’s have a look at the cost function for logistic regression.

Cost Function For Logistic Regression

The output of our logistic regression binary classifier is the composition of two functions, the linear regression function y = Wx + b and the sigmoid activation function. If f(x) = Wx + b and S(x) = 1/(1 + e^-x), then the output of the logistic regression binary classifier is S(f(x)).

So far, we only know about the Euclidean Distance cost function which we saw in Linear Regression. It is represented by: J = sum_over_all_training_examples((y_hat_i — y_i)²), where y_hat_i is the predicted output and y_i is the actual output for training example i. But can we use this same cost function for logistic regression? The answer is no because the output values in linear regression were real values whereas y_hat_i and y_i in logistic regression are probabilistic and a finite set of discrete values respectively. For example consider y_i is 1. If y_hat_i is 0.9, the the loss for this training example is (1–0.9)² i.e. 0.01. If, y_hat_i is 0.1, then the loss for this training example is (1–0.9)² i.e. 0.81. Clearly, the difference between the loss function values is not much even though according to the first prediction, we get the output class of 1 and according to the second prediction, we get the output class of 0. So, we need to penalize the probability values which lead to wrong classification much more than the probability values leading to the right classification. For this reason, we use a different kind of cost function for logistic regression which is the binary crossentropy which would enable us to solve the minimization problem on the parameters W and b.

Binary Crossentropy

In order to gain insight into choosing this function as the cost function for logistic regression, let’s understand the meaning of binary crossentropy. The information entropy, often just entropy, is a basic quantity in information theory associated to any random variable, which can be interpreted as the average level of “information”, “surprise”, or “uncertainty” inherent in the variable’s possible outcomes. Crossentropy determines how close is the predicted probability distribution (entropy) of the output classes to the true distribution. Binary crossentropy as a cost function therefore, compares the probabilistic output of the sigmoid function with the correct class label (out of 2 classes) and computes the cost. The lower the cost, the better the prediction is.

We chose the binary crossentropy as the cost function for logistic regression to penalize the probability values leading to misclassification. How does using the log function help though? Well, to come up with log function for this purpose is a work of years by many genius statisticians out there, but I can help explaining it. Lets have a look at the edge cases. Consider that for the class label 1 (y), our logistic regression model outputs a probability of 0.9 (y_hat). Then the loss corresponding to this prediction will be –{1*log(0.9) + 0*log(0.1)} i.e. 1.05. If the logistic regression output for this class had been 0.1, then the loss for this training example would have been –{1*log(0.1) + 0*log(0.9)} i.e. 2.30. If we compare these values with the loss values in Euclidean distance cost function, we can clearly see that the binary crossentropy function penalizes wrong predictions more. So, when we perform the learning operation by minimizing the binary crossentropy loss through gradient descent, the values of the parameters would be updated in a way which yields us higher accuracy than the Euclidean distance cost function.

Gradient Descent for Logistic Regression

Now that we have the equation to represent the output values of logistic regression and the cost function to penalize the parameter valuesfor misclassifications, we can use Gradient Descent to update our parameters W and b until we get those values which help us reach the global minima of the cost function. Gradient Descent algorithm for Logistic Regression is exactly the same as that of Linear Regression discussed in the previous post, the only difference comes in the formula for partial derivatives. Let’s walk through the whole process of gradient descent through a pseudo code.

temp_W = temp_b = random_values
flag_W = flag_b = False
do
  if ((abs(W — temp_W) > delta_W) and flag_W = False)
    temp_W = W — alpha * partial_derivate_wrt_W(J)
  if ((abs(b — temp_b) > delta_b) and flag_b = False)
    temp_b = b — alpha * partial_derivative_wrt_b(J)
  if(temp_W == W)
    flag_W = True
  if(temp_b == b)
    flag_b = True
  W = temp_W
  b = temp_b
while True

To calculate the partial derivative of the cost function J wrt the parameters W and b, we first replace y_hat in the cost function with the composition function used to represent the output of logistic regression binary classifier as shown above and then calculate the derivatives. The whole calculation can be found on the answer to this question’s answer on math.stackexchange.com. This answer ignores the bias parameter b and only computes the partial derivative of J wrt W, but the partial derivative of J wrt b can be calculated in a similar way.

Now that we have seen how binary classification in Logistic Regression works, lets extend this idea to multi-class classification in logistic regression. Additionally, we can also extend this problem with more than 1 input feature quite easily.

Let’s start by seeing the equations for binary classification with ’n’ input features. We can use the same equation we used for the multivariate linear regression problem and then take sigmoid value of the equation.

f(x) = W1x1 + W2x2 + W3x3 + ………. + Wnxn

S(x) = 1/(1 + e^-x)

Then, the output for binary classification with multiple input feature can be represented as S(f(x)).

Consider that there are a total of C classes labelled from 0 to C-1. Then there are 2 ways we can solve this problem.

(i) Let’s say that we are currently interested in predicting the class label k, then we divide the entire data into 2 classes, the class k represented as 1 and all other classes represented as 0 and then solve the binary classification problem. This process can be automated for all other classes as well. This works fine but is an inefficient method.

(ii) We can use a neural network with C outputs (and 1–2 hidden layers depending on the size and complexity of the data) and use the softmax activation function for the output layer. I’ll be covering more on this in the deep learning series.

Now that we’ve developed the understanding of how Logistic Regression works, let’s learn a bit about data pre-processing as not all data is as clean as the data we dealt with in the previous post.

The Need Of Data Pre-processing

One thing that may take you by surprise is the fact that to solve a Machine Learning problem, almost 90% of the time is spent in data pre-processing. Sounds strange right. I assume you might have conducted a survey at least once in your life. When we go through the responses of the survey, we can see that not all people answer all the questions. Some people may write numbers as words. There are tonnes of variations that can be seen in survey responses. Similarly, real world datasets are also prepared by surveying. So it’s quite possible that the given dataset may have missing values. There are many errors that humans can automatically resolve owing to their intelligence but the machine cannot. Additionally, we do not want our machine learning model to treat any of the input data features partially, so we need to scale down all the values to a specific range. Now that we have intuition about the need of data pre-processing, let’s have a look at some techniques one by one:

(i) Taking care of missing data

Some or all the features of the data may have missing values, which when loaded as a pandas dataframe are represent by NaN (Not a Number). For each input feature, we first check the percentage of NaN values in it. If the percentage of NaN values is significantly large for a feature, we can simply ignore this feature during the training process since its effect on the output is minimal. If we have training data loaded in a pandas dataframe as ‘train_df’, then we can check for the total number of NaN values across all features by the following line of code:

train_df.isna().sum()

This outputs all the features along with the total number of NaN values in each feature. Once we have decided on the features that we want to keep, the next step is to decide how to fill missing values in those features.

(ii) Filling the missing values

To fill the missing values in a feature, the first and foremost requirement is that the feature should have numerical values. Then we can use one of the mean, median or mode of the values of this feature and use this to fill the missing values. For example, if the ‘age’ column has missing values, then the code for filling missing values with the median of ‘age’ values will look like:

train_df[“age”].fillna(train_df[“age”].median(skipna=True), inplace=True)

(iii) Encoding categorical variables

Categorical variables are those features in the data which have a discrete number of non-numerical values. Suppose the dataset talks about the countries in the subcontinent, then the ‘country’ feature will only have a specific number of countries. Total number of different values inside a feature can be found out by one line in python:

train_df[‘country’].value_counts()

This outputs the number of instances of each country in the dataset. Suppose there are 8 distinct countries, then one way to replace the non-numerical values in the country column is by assigning each country a number between 1–8. This process is known as Label Encoding. In python, label encoding operation can be performed as follows:

from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
train_df[‘country_encoded’] = labelencoder.fit_transform(train_df[‘country’])

This creates a column called ‘country_encoded’ which has the numerical representation of each of the 8 countries. Other techniques of encoding categorical variables like one-hot encoding are also quite efficient, but the main idea here was to introduce you to this concept. More on one-hot encoding here.

(iv) Feature Scaling

In feature scaling, we convert all the values of our input features to a specific range so that no feature is preferred over the other during the learning process. Most of the built-in implementations of machine learning models in python take care of feature scaling but there are some cases when this has to be done by hand, so it was worth mentioning here.

If you’ve made it this far on the post, you should be proud of yourself. We explored each of the elements of logistic regression in detail, learnt about various activation functions and also understood why data pre-processing is important and looked at some of the ways to pre-process data. Now, to consolidate all our concepts and get confidence over the topics covered in this post, let’s solve the Titanic dataset on Kaggle using the Logistic Regression binary classifier. This problem was a part of a competition, so even if we get ‘average’ results, we should be in good space for solving any of the logistic regression problems we encounter in the future. I’m talking about average results because there are many techniques better than logistic regression which we’ll see in later posts and they come with practice.

Problem statement

Click here for the dataset. The given data has 12 features. Training dataset has 891 examples and the data on which we have to submit the predictions has 418 examples. This problems can be stated as a logistic regression problem as:

Y = sigmoid(W1x1 + W2x2 + W3x3 + W4x4 + W5x5 + W6x6 + W7x7 + W8x8 + W9x9 + W10x10 + W11x11 + W12x12)

First of all let’s import all the python packages for data pre-processing, mathematical calculations, plotting graphs and performing logistic regression.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, classification_report, precision_score, recall_score
from sklearn.metrics import confusion_matrix, precision_recall_curve, roc_curve, auc, log_loss
import matplotlib.pyplot as plt
import osfor dirname, _, filenames in os.walk(‘/kaggle/input’):
  for filename in filenames:
    print(os.path.join(dirname, filename))

Now, let’s look at a few examples from train and test dataset.

train_df = pd.read_csv(“/kaggle/input/titanic/train.csv”)
test_df = pd.read_csv(“/kaggle/input/titanic/test.csv”)train_df.head()

test_df.head()

From the above outputs, it’s evident that we have to predict whether the passenger survived on the titanic or not, given his details. Now, before actually applying the logistic regression binary classifier to this data, let’s move on to data pre-processing phase, which will consume most of our time.

Let’s start by checking missing values.

train_df.isna().sum()

We can see that out of 891 records, 177 records for age are missing, 687 records for cabin are missing and 2 records for embarked are missing, which in percentages can be represented as:

print(“Percentage missing age values: “ + str((train_df[‘Age’].isna().sum()/train_df.shape[0])*100))print(“Percentage missing cabin values: “ + str((train_df[‘Cabin’].isna().sum()/train_df.shape[0])*100))print(“Percentage missing embarked values: “ + str((train_df[‘Embarked’].isna().sum()/train_df.shape[0])*100))

We observe that close to 80% values in Cabin column are missing, so we can’t confidently fill these values. So we drop the entire Cabin column. Now to see which values we can fill for age, let’s see how age values are distributed across the dataset.

age_distribution = dict(train_df[“Age”].value_counts())
lists = sorted(age_distribution.items())
x, y = zip(*lists)
plt.plot(x, y)
plt.show()

We can see that the age distribution is a bit skewed to the left. So using the median values will hold good for missing age values. On the other hand, the Embarked column only has 3 types of values, so we fill the missing 2 values with the most frequent value i.e. the mode of the Embarked column.

train_data = train_df.copy()
train_data[“Age”].fillna(train_df[“Age”].median(skipna=True), inplace=True)
train_data[“Embarked”].fillna(train_df[‘Embarked’].value_counts().idxmax(), inplace=True)train_data.drop(‘Cabin’, axis=1, inplace=True)

We can now see that all the missing values are gone.

train_data.isna().sum()

Now, let’s have a look at the current state of the training data.

train_data.head()

There are still a few things we need to fine tune. We don’t need the Name and Ticket column at all because they clearly have no role to play in predictions (unless you can make your machines learn astrology). Now, we need to encode the categorical column values i.e. Pclass, Sex, Embarked. Even though Pclass has integer values, these integers represent the location on the titanic for which the passenger bought the ticket. We use one-hot encoding to encode these columns.

training=pd.get_dummies(train_data, columns=[“Pclass”,”Embarked”,”Sex”])
training.drop(‘Sex_female’, axis=1, inplace=True)
training.drop(‘PassengerId’, axis=1, inplace=True)
training.drop(‘Name’, axis=1, inplace=True)
training.drop(‘Ticket’, axis=1, inplace=True)
final_train = trainingfinal_train.head()

Now, our training set looks good. Let’s check for missing values in the test set and apply all the transformations on it that we did for the training set.

test_df.isna().sum()

test_data = test_df.copy()
test_data[“Age”].fillna(train_df[“Age”].median(skipna=True), inplace=True)
test_data[“Fare”].fillna(train_df[“Fare”].median(skipna=True), inplace=True)
test_data.drop(‘Cabin’, axis=1, inplace=True)testing = pd.get_dummies(test_data, columns=[“Pclass”,”Embarked”,”Sex”])
testing.drop(‘Sex_female’, axis=1, inplace=True)
testing.drop(‘PassengerId’, axis=1, inplace=True)
testing.drop(‘Name’, axis=1, inplace=True)
testing.drop(‘Ticket’, axis=1, inplace=True)final_test = testing
final_test.head()

Now, for a small dataset like this, we are left with too many features. In the previous post, we used only 8 features, so let’s try selecting top 8 features. Feature ranking with recursive feature elimination(RFE) helps us in doing so. It is available as a function in sci-kit learn. More on RFE can be found here. But for this recursive operation to converge, we need to scale down the values in the training data as RFE does not apply scaling operation itself, which can lead to bad feature selection.

sc = StandardScaler()final_train[[“Age”, “Fare”]] = sc.fit_transform(final_train[[“Age”, “Fare”]])
final_test[[“Age”, “Fare”]] = sc.fit_transform(final_test[[“Age”, “Fare”]])final_train.head()

from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFEcols = [“Age”, “SibSp”, “Parch”, “Fare”, “Pclass_1”, “Pclass_2”, “Pclass_3”, “Embarked_C”, “Embarked_Q”, “Embarked_S”, “Sex_male”]X = final_train[cols]
y = final_train[‘Survived’]model = LogisticRegression()# selecting top 8 features
rfe = RFE(model, n_features_to_select = 8)
rfe = rfe.fit(X, y)print(‘Top 8 most important features: ‘ + str(list(X.columns[rfe.support_])))

Now, that we these features, let’s finally go and apply Logistic Regression on our training data.

selected_features = [‘Age’, ‘SibSp’, ‘Pclass_1’, ‘Pclass_3’, ‘Embarked_C’, ‘Embarked_Q’, ‘Embarked_S’, ‘Sex_male’]X = final_train[selected_features]\
y = final_train[‘Survived’]X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.15, random_state=2)logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
y_pred_proba = logreg.predict_proba(X_test)[:, 1]print(‘Train/Test split results:’)
print(“Logistic Regression accuracy is: “ + str(accuracy_score(y_test, y_pred)))
print(“Logistic Regression log_loss is: “ + str(log_loss(y_test, y_pred_proba)))

We get around 78% accuracy, which is decent. When I submitted my submission file on Kaggle by predicting the Survived values on the test dataset they provided, I got a public score of 0.7703, which is good considering the fact that we applied nothing other than logistic regression. The way we processed data helped us in getting this score. Without data pre-processing, our results would have been way worse. The entire code for this post can be found here.

Phew!! This was a long post. Thank you all for bearing with me and making it this far. Believe me the strategies we went through in this article are used by Data Scientists across the world on a daily basis. Even I used these strategies during my internship period. In the next post, we’ll have a detailed look at underfitting and overfitting problems in machine learning and see techniques to address these problems.