TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Understand & Implement Logistic Regression in Python

Do Lee
14 min readJun 15, 2021

--

Photo by Meg Boulden on Unsplash

Objective

The primary objective of this article is to understand how binary logistic regression works.

  • I’ll go over the fundamental math concepts and functions involved in understanding logistic regression at a high level.
  • In the process, I’ll go over two well-known gradient approaches (ascent/descent) to estimate the 𝜃 parameters using log-likelihood and cross-entropy loss functions.
  • I’ll use Kaggle’s Titanic dataset to create a logistic regression model from scratch to predict passenger survival. The train.csv and test.csv files are available on Kaggle’s Titanic data page. To expedite the model build, I’ll only use three features, age (continuous), pclass (1st Class = 1, 2nd Class = 2, and 3rd Class = 3), and sex (female = 1 and male = 0).
Figure 0: Image by Author; Data Source: Kaggle’s Titanic Data

Table of Contents

  • Algorithm Synopsis
  • Math Concepts & Functions
  • Code in Action
  • Final Thoughts

Algorithm Synopsis

Logistic regression, a classification algorithm, outputs predicted probabilities for a given set of instances with features paired with optimized 𝜃 parameters plus a bias term. The parameters are also known as weights or coefficients. The probabilities are turned into target classes (e.g., 0 or 1) that predict, for example, success (“1”) or failure (“0”). Although I’ll be closely examining a binary logistic regression model, logistic regression can also be used to make multiclass predictions.

How do we take linearly combined input features and parameters and make binary predictions? The answer is natural-logarithm (log base e). More specifically, log-odds. The linearly combined input features and parameters are summed to generate a value in the form of log-odds. Then, the log-odds value is plugged into the sigmoid function and generates a probability. At its core, like many other machine learning problems, it’s an optimization problem. The best parameters are estimated using gradient ascent (e.g., maximizing log-likelihood) or descent (e.g., minimizing cross-entropy loss), where the chosen objective (e.g., cost, loss, etc.) function determines the gradient approach.

Math Concepts & Functions

For everything to be more straightforward, we have to dive deeper into the math. In logistic regression, the sigmoid function plays a key role because it outputs a value between 0 and 1 — perfect for probabilities. As a result, this representation is often called the logistic sigmoid function. Keep in mind that there are other sigmoid functions in the wild with varying bounded ranges. And because the response is binary (e.g., True vs. False, Yes vs. No, Survived vs. Not Survived), the response variable will have a Bernoulli distribution. Furthermore, each response outcome is determined by the predicted probability of success, as shown in Figure 5.

About Math Notations: The lowercase “iwill represent the row position in the dataset while the lowercase “jwill represent the feature or column position in the dataset. You will also come across lowercase bolded non-italic “x”. This represents a feature vector. More specifically, when “i” is accompanied by “x” (xi), as shown in Figures 5, 6, 7, and 9, this represents a vector (an instance/a single row) with all the feature values.

When you see “i” and “j” with lowercase italic “x” (xi,j) in Figures 8 and 10, the value is a representation of a jth feature in an ith (a single feature vector) instance. The number of features (columns) in the dataset will be represented as “n” while number of instances (rows) will be represented by the “m” variable.

Sigmoid Function

In Figure 1, the first equation is the sigmoid function, which creates the S curve we often see with logistic regression. The output equals the conditional probability of y = 1 given x, which is parameterized by 𝜃. In this case, the x is a single instance (an observation in the training set) represented as a feature vector.

Figure 1: Sigmoid function [sigma(z)] and [1 — sigma(z)]; Image by Author

Each feature in the vector will have a corresponding 𝜃 parameter estimated using an optimization algorithm. I’ll talk more about this later in the gradient ascent/descent section. It’s also important to note that by solving for p in log(odds) = log(p/(1-p)) we get the sigmoid function with z = log(odds).

Figure 2: Sigmoid Curve; Image by Author

proba1, proba0 = [], []
feature_example = [i for i in range(-10, 11)]
for instance in feature_example:
p1 = 1/(1 + np.exp(-instance)) # sigmoid function
p0 = 1 — p1
proba1.append(p1)
proba0.append(p0)
plt.figure(figsize=(12,6))
plt.plot(feature_example, proba1, marker=’.’, label=’predict proba “1”’)
plt.plot(feature_example, proba0, marker=’.’, linestyle=’dashed’, label=’predict proba “0”’)
plt.title(‘Sigmoid Curve’)
plt.xlabel(‘feature_example (in log-odds)’)
plt.ylabel(‘probability’)
plt.legend(prop={‘size’: 12});

Odds and Log-Odds

The estimated y value (y-hat) using the linear regression function represents log-odds. The process of wrapping log around odds or odds ratios is called the logit transformation. The key takeaway is that log-odds are unbounded (-infinity to +infinity). However, we need a value to fall between 0 and 1 to predict probability. So, in essence, log-odds is the bridge that closes the gap between the linear and the probability form.

Figure 3: Linear regression function and y-estimate as log-odds; Image by Author

What is log-odds? We first need to know the definition of odds — the probability of success divided by failure, P(success)/P(failure). For example, the probability of tails and heads is both 0.5 for a fair coin. Therefore, the odds are 0.5/0.5, and this means that odds of getting tails is one. If we were to use a biased coin in favor of tails, where the probability of tails is now 0.7, then the odds of getting tails is 2.33 (0.7/0.3). As shown in Figure 3, the odds are equal to p/(1-p). The next step is to transform odds into log-odds.

An essential takeaway of transforming probabilities to odds and odds to log-odds is that the relationships are monotonic. When probability increase, the odds increase, and vice versa. When odds increase, so do log-odds and vice versa. In Figure 4, I created two plots using the Titanic training set and Scikit-Learn’s logistic regression function to illustrate this point. We can clearly see the monotonic relationships between probability, odds, and log-odds.

Why is this important? The higher the log-odds value, the higher the probability. In Figure 2, we can see this pretty clearly. Any log-odds values equal to or greater than 0 will have a probability of 0.5 or higher. And this is due to the monotonic relationships we observed in Figure 4. This will also come in handy when we are interpreting the estimated parameters.

Figure 4: Probability vs. Odds & Odds vs. Log Odds; Image by Author
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
import matplotlib.pyplot as plt
plt.style.use(‘ggplot’)
%matplotlib inline
logreg = LogisticRegression(random_state=0)
model_pipe = make_pipeline(StandardScaler(), logreg)
X = train[[‘age’,’pclass’,’sex’]]
y = train[‘survived’]
model_pipe.fit(X, y)
model_pipe.predict_proba(X)
y_pred_proba_1 = model_pipe.predict_proba(X)[:,1]
y_pred_proba_0 = model_pipe.predict_proba(X)[:,0]
odds = y_pred_proba_1 / y_neg_pred_proba_0
log_odds = np.log(odds)
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16,6))
ax1.set_title(‘Probability vs. Odds’)
ax1.set(xlabel=’Probability’, ylabel=’Odds’)
ax1.scatter(y_pos_pred_proba, odds)
ax2.set_title(‘Odds vs. Log Odds’)
ax2.set(xlabel=’Odds’, ylabel=’Log Odds’)
ax2.scatter(odds, log_odds);

Sigmoid Combined Probability, Likelihood, and Log-likelihood

We covered a lot of ground, and we are now at the last mile of understanding logistic regression at a high level. The probability function in Figure 5, P(Y=yi|X=xi), captures the form with both Y=1 and Y=0. Based on Y (0 or 1), one of the terms in the dot product becomes 1 and drops off. This combined form becomes crucial in understanding likelihood.

Figure 5: Probability function as a combined form with Y=0 and Y=1 terms; Image by Author

Let’s walk through how we get likelihood, L(𝜃). For every instance in the training set, we calculate the log-odds using randomly estimated parameters (𝜃’s) and predict the probability using the sigmoid function corresponding to a specific binary target variable (0 or 1). The multiplication of these probabilities would give us the probability of all instances and the likelihood, as shown in Figure 6. It is important to note that likelihood is represented as the likelihood of 𝜃 while probability is designated as the probability of Y.

Figure 6: Likelihood function; Source Reading: Chris Piech, CS109 @ Stanford University; Image by Author

As a result, by maximizing likelihood, we converge to the optimal parameters. In other words, maximizing the likelihood to estimate the best parameters, we directly maximize the probability of Y. This is called the Maximum Likelihood Estimation (MLE). Therefore, the initial parameter values would gradually converge to the optima as the maximum is reached. The convergence is driven by the optimization algorithm — gradient ascent/descent.

How does log-likelihood fit into the picture? By taking the log of the likelihood function, it becomes a summation problem versus a multiplication problem. We know that log(XY) = log(X) + log(Y) and log(X^b) = b * log(X). Therefore, we can easily transform likelihood, L(𝜃), to log-likelihood, LL(𝜃), as shown in Figure 7. Because likelihood to log-likelihood is a monotonic transformation, maximizing log-likelihood will also produce the best parameters — this is called the Maximum Log-Likelihood. Again, keep in mind that it is the log-likelihood of 𝜃, which we are optimizing.

Figure 7: Log-likelihood function; Source Reading: Chris Piech, CS109 @ Stanford University; Image by Author

How do we reach the maximum using log-likelihood? We take the partial derivative of the log-likelihood function with respect to each 𝜃 parameter. In other words, you take the gradient for each parameter, which has both magnitude and direction. For example, in the Titanic training set, we have three features plus a bias term with x0 equal to 1 for all instances. The partial derivative in Figure 8 represents a single instance (i) in the training set and a single 𝜃 parameter (j). The x (i, j) represents a single feature in an instance paired with its corresponding 𝜃 (i, j)parameter. As a result, for a single instance, a total of four partial derivatives — bias term, pclass, sex, and age — are created. These make up the gradient vector.

The η is the learning rate determining how big a step the gradient ascent algorithm will take for each iteration. We don’t want the learning rate to be too low, which will take a long time to converge, and we don’t want the learning rate to be too high, which can overshoot and jump around. The learning rate is also a hyperparameter that can be optimized, but I’ll use a fixed learning rate of 0.1 for the Titanic exercise.

Once you have the gradient vector and the learning rate, two entities are multiplied and added to the current parameters to be updated, as shown in the second equation in Figure 8. Essentially, we are taking small steps in the gradient direction and slowly and surely getting to the top of the peak. This updating step repeats until the parameters converge to their optima — this is the gradient ascent algorithm at work. Because the log-likelihood function is concave, eventually, the small uphill steps will reach the global maximum.

Figure 8: Partial derivative for each 𝜃 parameter and gradient ascent algorithm; Image by Author

What about minimizing the cost function? We often hear that we need to minimize the cost or the loss function. It is also called an objective function because we are trying to either maximize or minimize some numeric value. In the context of a cost or loss function, the goal is converging to the global minimum. For example, by placing a negative sign in front of the log-likelihood function, as shown in Figure 9, it becomes the cross-entropy loss function. The goal is to minimize this negative function using the gradient descent algorithm (second equation in Figure 10). This process is the same as maximizing the log-likelihood, except we minimize it by descending to the minimum.

Figure 9: Cross-entropy loss function or negative log-likelihood (NLL) function; Source Reading: Machine Learning: A Probabilistic Perspective by Kevin P. Murphy — Chapter 8; Image by Author

Once the partial derivative (Figure 10) is derived for each 𝜃 parameter, the form is the same as in Figure 8. The big difference is the subtraction term, where it is re-ordered with sigmoid predicted probability minus actual y (0 or 1). This term is then multiplied by the x (i, j) feature. The process is the same as the process described in the gradient ascent section above. Thus, we will end up with four partial derivatives for every instance in the training set. And using the gradient descent algorithm, we update the parameters until they converge to their optima.

Figure 10: Partial derivative for each theta parameter and gradient descent algorithm; Source Reading: Speech and Language Process by Dan Jurafsky and James H. Martin (3rd Edition Draft) — Chapter 5; Image by Author

Thankfully, the cross-entropy loss function is convex and naturally has one global minimum. Eventually, with enough small steps in the direction of the gradient, which is the steepest descent, it will end up at the bottom of the hill.

Code in Action

Now that we have reviewed the math involved, it is only fitting to demonstrate the power of logistic regression and gradient algorithms using code. Let’s start with our data. We have the train and test sets from Kaggle’s Titanic Challenge. As mentioned earlier, I’m only using three features — age, pclass, and sex — to predict passenger survival.

First, we need to scale the features, which will help with the convergence process. I’ll be using the standardization method to scale the numeric features. In standardization, we take the mean for each numeric feature and subtract the mean from each value. This term is then divided by the standard deviation of the feature. Next, we’ll add a column with all ones to represent x0. This is for the bias term.

Feature Standardization Code by Author

With the above code, we have prepared the train input dataset. We now know that log-odds is the output of the linear regression function, and this output is the input in the sigmoid function. The only missing pieces are the 𝜃 parameters. Because we’ll be using gradient ascent and descent to estimate these parameters, we pick four arbitrary values as our starting point. I’ll be using four zeroes as the initial values. We also need to define the sigmoid function in code because this will generate our probabilities.

# sigmoid function where z = log-odds
def sigmoid(z):
predict_proba = 1 / (1 + np.exp(-z))
return predict_proba

We have all the pieces in place. Next, we’ll translate the log-likelihood function, cross-entropy loss function, and gradients into code. We also need to determine how many times we want to go through the training set. We need to define the number of epochs (designated as n_epoch in code below, which is a hyperparameter helping with the learning process). What is an epoch? In the context of gradient ascent/descent algorithm, an epoch is a single iteration, where it determines how many training instances will pass through the gradient algorithm to update the 𝜃 parameters (shown in Figures 8 and 10).

Therefore, we commonly come across three gradient ascent/descent algorithms: batch, stochastic, and mini-batch. For the Titanic exercise, I’ll be using the batch approach. This means, for every epoch, the entire training set will pass through the gradient algorithm to update the 𝜃 parameters.

Gradient Ascent Code by Author

Here you have it! By maximizing the log-likelihood through gradient ascent algorithm, we have derived the best 𝜃 parameters for the Titanic training set to predict passenger survival.

  • 𝜃0: -0.89362715 (bias term)
  • 𝜃1: -1.41685703 (pclass)
  • 𝜃2: 1.24119596 (sex)
  • 𝜃3: -0.60707722 (age)

Let’s visualize the maximizing process. In Figure 11, we can see that we reached the maximum after the first epoch and continues to stay at this level.

Figure 11: Maximizing log-likelihood function per epoch iteration; Image by Author
x_axis = [i for i in range(1, 11)]
plt.figure(figsize=(14,6))
plt.title(‘Maximizing Log-Likelihood’)
plt.xticks(x_axis)
plt.xlabel(‘epoch iteration’)
plt.ylabel(‘log-likelihood’)
plt.plot(x_axis, log_likelihood_vals, marker=’o’)
plt.tight_layout()

We can also visualize the 𝜃 parameters converging for every epoch iteration. As we saw in Figure 11, log-likelihood reached the maximum after the first epoch; we should see the same for the parameters. In Figure 12, we see the parameters converging to their optimum levels after the first epoch, and the optimum levels are maintained as the code iterates through the remaining epochs. Iterating through the training set once was enough to reach the optimal 𝜃 parameters. Let’s examine what is going on during each epoch interval.

The plots on the right side in Figure 12 show 𝜃 parameter values quickly moving towards their optima. As it continues to iterate through the training instances in each epoch, the parameter values oscillate up and down (epoch intervals are denoted as black dashed vertical lines). At the end of each epoch, we end with the optimal parameter values and these values are maintained.

Figure 12: 𝜃 parameter convergence per epoch (left) and per instance (right); Image by Author
Figure 12 — Code by Author

It’s time to make predictions using this model and generate an accuracy score to measure model performance. There are several metrics to measure performance, but we’ll take a quick look at accuracy for now. The code below generated an accuracy score of 79.8%.

from sklearn.metrics import accuracy_scoredef make_predictions(X, thetas):
X = X.copy()
predictions = []
for x_row in X:
log_odds = sum(thetas * x_row[:4])
pred_proba = sigmoid(log_odds)

# if probability >= 0.5, predicted class is 1
predictions.append(1 if pred_proba >= 0.5 else 0)
return predictions
y_train = X_train[:, -1]
y_predict = make_predictions(X_train[:, :4], thetas)
accuracy_score(y_train, y_predict)

What about cross-entropy loss function? The results from minimizing the cross-entropy loss function will be the same as above. There are only a few lines of code changes and then the code is ready to go (see ‘# changed’ in code below). Here’s the code.

Gradient Descent Code by Author

Let’s take a look at the cross-entropy loss function being minimized using gradient descent. We reached the minimum after the first epoch, as we observed with maximum log-likelihood. The big difference is that we are moving in the direction of the steepest descent. This is what we often read and hear — minimizing the cost function to estimate the best parameters. We are now equipped with all the components to build a binary logistic regression model from scratch. I hope this article helped you as much as it has helped me develop a deeper understanding of logistic regression and gradient algorithms.

Figure 13: Minimizing cross-entropy loss function; Image by Author
x_axis = [i for i in range(1, 11)]
plt.figure(figsize=(14,6))
plt.title(‘Minimizing Cross-Entropy Loss’)
plt.xticks(x_axis)
plt.xlabel(‘epoch iteration’)
plt.ylabel(‘cross-entropy loss’)
plt.plot(x_axis, ce_loss_vals, marker=’o’)
plt.tight_layout()

Final Thoughts

In this article, my goal was to provide a solid introductory overview of the binary logistic regression model and two approaches in estimating the best 𝜃 parameters. As we saw in the Titanic example, the main obstacle was estimating the optimal 𝜃 parameters to fit the model and using the estimates to predict passenger survival. We examined the (maximum) log-likelihood function using the gradient ascent algorithm. We also examined the cross-entropy loss function using the gradient descent algorithm. Of course, you can apply other cost functions to this problem, but we covered enough ground to get a taste of what we are trying to achieve with gradient ascent/descent.

Where do we go from here? There are several areas that we can explore in terms of improving the model. We can start with the learning rate. The learning rate is a hyperparameter and can be tuned. In many cases, a learning rate schedule is introduced to decrease the step size as the gradient ascent/descent algorithm progresses forward.

If the dataset is massive, the batch approach might not be ideal. Understanding the mechanics of stochastic and mini-batch gradient descent algorithms will be much more helpful. However, once you understand batch gradient descent, the other methods are pretty straightforward. There are also different feature scaling techniques in the wild beyond the standardization method I used in this article.

If you encounter any issues or have feedback for me, feel free to leave a comment. Thanks for reading!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Do Lee
Do Lee