Unveiling the Power of Logistic Regression: A Practical Guide to Predictive Modeling

Rayyan Physicist
7 min readJun 1, 2024

--

Logistic regression is a statistical method used in various fields like medicine, social sciences, and machine learning to predict the probability of a binary outcome. In simpler terms, it’s a way to predict if something will happen or not, based on given data.

Brief Overview:

What is Logistic Regression?

Logistic regression is used in supervised learning. Unlike linear regression, which predicts continuous values, logistic regression is used for classification problems where the outcomes are categorical. Most commonly, these outcomes are binary, meaning there are only two possible results, such as:

  • Yes/No
  • Success/Failure
  • Win/Lose

Common Use Cases of Logistic Regression:

Logistic regression is a versatile tool used across various industries to solve classification problems. Here are some common use cases where logistic regression is applied:

1. Healthcare: Disease Diagnosis:

  • Predicting the Presence of Diseases: Logistic regression is used to predict the likelihood of a patient having a particular disease based on symptoms, test results, and demographic data. For example, it can help determine the probability of heart disease based on factors like age, cholesterol levels, and blood pressure.

2. Finance: Credit Scoring and Fraud Detection:

  • Fraud Detection: Logistic regression is used to detect fraudulent transactions by analyzing patterns in transaction data. It helps in identifying unusual behavior that may indicate fraud.

3. Marketing: Customer Behavior Prediction

  • Customer Churn: Businesses use logistic regression to predict whether a customer is likely to stop using a service or product. By understanding the factors that lead to churn, companies can take proactive measures to retain customers.

4. E-commerce: Click-Through Rate (CTR) Prediction

  • Ad Click Prediction: Logistic regression is used to predict whether a user will click on an online advertisement. This helps in optimizing ad placements and improving the efficiency of online marketing campaigns.

Explanation of Binary Classification:

Binary classification is a type of predictive modeling problem where the goal is to classify instances into one of two possible categories. These categories are often represented as 0 and 1, such as:

  • Spam/Not Spam: In email filtering, classify emails as spam (1) or not spam (0).
  • Disease/No Disease: In medical diagnosis, classify patients as having a disease (1) or not (0).

The objective is to predict which category a new instance belongs to, based on various input features.

Comparison with Linear Regression:

  • Linear Regression: Used for predicting continuous outcomes, like predicting a person’s weight based on their height. The relationship between input variables and the output is modeled with a straight line.
  • Logistic Regression: Used for predicting binary outcomes, like whether a person will buy a product or not. Instead of fitting a straight line, logistic regression uses a logistic function to model the probability of the outcome.

Logistic Function (Sigmoid Function):

The logistic function, also known as the sigmoid function, is used in logistic regression to map predicted values to probabilities. It has an S-shaped curve and is defined as:

Sigmoid

Where z is a linear combination of the input features. Z= mx+c

  • Outputs values between 0 and 1.
  • As z approaches positive infinity, value of function approaches 1.
  • As z approaches negative infinity, value of function approaches 0.

What is a Decision Boundary?

In logistic regression, the decision boundary is the threshold at which the predicted probability determines the classification of an outcome. It is the line (in two dimensions) or surface (in higher dimensions) that separates the data points of different classes.

For binary classification:

  • If the predicted probability is greater than or equal to 0.5, the instance is classified as class 1.
  • If the predicted probability is less than 0.5, the instance is classified as class 0.

Cost Function in Logistic Regression: Log Loss:

Log loss, also known as logistic loss or cross-entropy loss, is the cost function used in logistic regression. It measures the performance of a classification model whose output is a probability value between 0 and 1. Log loss increases as the predicted probability diverges from the actual label.

Formula

The log loss for a single instance can be defined as:

For a dataset with multiple instances, the total log loss is the average of the log loss across all instances:

yi​: The actual binary label for the i-th instance. It can be either 0 or 1, indicating the true class of the instance.

pi​: The predicted probability for the i-th instance. This value is between 0 and 1.

Why Use Log Loss in Logistic Regression Instead of Mean Squared Error:

In logistic regression, choosing the appropriate cost function is critical for effective training and accurate predictions. One of the key reasons log loss (cross-entropy loss) is preferred over mean squared error (MSE) is due to the concept of convexity and its impact on optimization. Log loss is a convex function, meaning that it has a single global minimum. This property ensures that optimization algorithms, like gradient descent, can reliably converge to this global minimum, resulting in the best possible model. In contrast, using MSE in logistic regression leads to a non-convex function. The sigmoid transformation in logistic regression creates a non-linear relationship between the predicted probabilities and the input features, causing multiple local minima. Non-convexity makes it challenging for optimization algorithms to find the global minimum, often resulting in suboptimal models that perform poorly in classification tasks. Therefore, the convexity of log loss not only simplifies the optimization process but also guarantees more accurate and reliable predictions, making it the suitable choice for logistic regression.

Gradient descent:

Gradient descent is a fundamental optimization algorithm used in various machine learning models, including logistic regression. Its primary objective is to minimize the cost function iteratively by adjusting the model parameters. Initially, the algorithm starts with arbitrary or predefined parameter values. Subsequently, it computes the gradient of the cost function with respect to each parameter, indicating the direction of the steepest increase of the function. Then, it updates the parameters in the opposite direction of the gradient, with the size of the step determined by the learning rate. This process is repeated until convergence criteria are satisfied, typically when further iterations no longer substantially reduce the cost function. At convergence, the parameters represent the optimized values that minimize the cost function, yielding the best-fitted model.

As cost function we have,

So, during each iteration of gradient descent, we update the weights and bias as follows:

Where α is the learning rate, which determines the step size of each update. By iteratively updating the weights and bias using these equations, we gradually minimize the cost function and obtain the optimal parameters for our logistic regression model.

Regularization:

Regularization, broadly speaking, is a technique used in machine learning to reduce risk of overfitting by adding a penalty term to the model’s objective function. This penalty term discourages the model from learning overly complex patterns from the training data.

L1 Regularization (Lasso):

L1 regularization adds a penalty term to the model’s cost function that is proportional to the absolute values of the model’s coefficients.

L2 Regularization (Ridge):

L2 regularization adds a penalty term to the model’s cost function that is proportional to the square of the model’s coefficients.

The choice between L1 and L2 regularization (or a combination of both, known as Elastic Net regularization) depends on factors such as the specific characteristics of the data and the interpretability of the model.

Implementation of Logistic Regression in Python:

Accuracy, precision, recall, and f1-score are basically the metrics used to evaluate the model, we will discuss these briefly in another article.

Challenges and limitations:

Limited Expressiveness:

  • Logistic regression is a linear classifier, which means it can only learn linear decision boundaries. It may struggle with complex, non-linear relationships in the data.

Sensitive to Outliers:

  • Logistic regression is sensitive to outliers, especially if the outliers are influential on the estimated coefficients. Outliers can distort the estimated coefficients and affect the model’s performance.

Imbalanced Data:

  • Logistic regression may perform poorly on imbalanced datasets, where one class is significantly more prevalent than the other. It tends to be biased towards the majority class, leading to poor predictions for the minority class.

Limited to Binary Classification:

  • Logistic regression is inherently a binary classifier and cannot directly handle multi-class classification problems. While there are extensions like multinomial logistic regression, logistic regression is not as flexible as some other algorithms for multi-class classification tasks.

Hope it helps!
Feel free to reach out for suggestions and queries you have https://www.linkedin.com/in/md-rayyan/

--

--

Rayyan Physicist

AI researcher | ML | DL | NLP | Computer Vision | Data Science | Generative AI ( LLMs, RAG, Diffusion Models) | MAS | Astrophysics Researcher 🌌