Advanced Techniques for Logistic Regression and Classification — Part 2

12 min readJun 2, 2023

View the accompanying Colab notebook.

In part one of advanced regression techniques, we explored several advanced techniques in logistic regression, including handling nonlinear relationships, addressing multicollinearity, feature scaling and normalization, and handling categorical variables. This time around, we’ll focus on additional techniques including selection, hyperparameter tuning, model evaluation metrics, ensemble methods, and regularization path visualization.

While this tutorial is part of my logistic regression series, you’ll find that most of the topics here apply to any classification problem. Let’s dive in!

Feature Selection Techniques

Selecting relevant features for logistic regression is crucial for building accurate and interpretable models. There are various feature selection techniques that can be employed:

Filter methods: These techniques rank features based on their individual relationship with the target variable, such as correlation or mutual information. Examples include Pearson’s correlation coefficient and chi-squared test.
Wrapper methods: These techniques evaluate feature subsets by training a model on each subset and assessing its performance. Examples include recursive feature elimination and forward selection.
Embedded methods: These techniques incorporate feature selection as part of the model training process. Examples include Lasso (L1 regularization) and decision trees.

We’ve previously discussed regularization methods and reducing collinearity, so let’s focus on wrapper methods. Here’s an example of using recursive feature elimination with logistic regression in scikit-learn:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Create a logistic regression model
model = LogisticRegression(max_iter=10000)

# Use RFE to select the top 5 features
rfe = RFE(model, n_features_to_select=5)
rfe.fit(X_train, y_train)

# Get the selected features
selected_features = rfe.support_

# Print the selected features
feature_names = data.feature_names
print("Selected features:")
for i, feature in enumerate(feature_names):
    if selected_features[i]:
        print(feature)

Hyperparameter Tuning with Grid Search and Randomized Search

Hyperparameter tuning is essential for optimizing the performance of logistic regression models. Two popular techniques for hyperparameter tuning are grid search and randomized search:

Grid search: This technique exhaustively searches through a predefined set of hyperparameter values, evaluating each combination using cross-validation.
Randomized search: This technique samples a fixed number of hyperparameter combinations from a predefined distribution, evaluating each combination using cross-validation.

We saw this in part two of the series, but now is a good time to revisit grid search now that we have more context for understanding logistic regression.

Here’s an example of using grid search in scikit-learn:

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Create a logistic regression model
model = LogisticRegression(max_iter=5000, solver='liblinear')

# Define the parameter grid
# (try different orders of magnitude for C)
param_grid = {
  'C': [0.001, 0.01, 0.1, 1, 10, 100],
  'penalty': ['l1', 'l2']
}

# Perform grid search with 5-fold cross-validation
grid_search = GridSearchCV(model, param_grid, cv=5, scoring='f1')
grid_search.fit(X_train, y_train)

# Get the best hyperparameters
best_model = grid_search.best_estimator_
print(best_model)

Grid search is extremely powerful, but it comes with at least two notable drawbacks. For one, it’s slow. It has to test every combination of hyperparameters you specify, and in the example above, we have 6 values of C and 2 penalties, leading to 12 combinations with 5 cross validations each— and I’d consider this a relatively simple grid search compared to many I’ve done before.

Secondly, there’s a small risk of overfitting your model: you may find a set of hyperparameters that perform exceptionally well on the specific cross-validation folds but do not generalize well to unseen data. On the other hand, randomized search samples a random subset of the hyperparameter space, which can help prevent overfitting by not searching the entire space exhaustively.

Let’s take a look at how to implement a randomized search in scikit-learn.

from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.linear_model import LogisticRegression
import numpy as np

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Create a logistic regression model
model = LogisticRegression(max_iter=5000, solver='liblinear')

# Define the parameter distribution
param_dist = {
  'C': np.logspace(start=-3, stop=2, num=6), # 6 numbers from 10e-3 to 10e2
  'penalty': ['l1', 'l2']
}

# Perform randomized search with 5-fold cross-validation
random_search = RandomizedSearchCV(model, param_dist, cv=5,
  scoring='f1', n_iter=6)
random_search.fit(X_train, y_train)

# Get the best hyperparameters
best_model = random_search.best_estimator_
print(best_model)

This will test 6 random combinations of hyperparameters, and you can make your random search more or less exhaustive by changing n_iter.

Advanced Model Evaluation Metrics and Techniques

We previously discussed common performance metrics for logistic regression, including precision, recall, F1 score, confusion matrix, and ROC-AUC score. I recommend reviewing that article for a comprehensive understanding of these metrics. In this advanced discussion, I’ll introduce some additional evaluation metrics and techniques, such as the precision-recall curve, cost-sensitive metrics, model calibration, and cross-validation techniques.

Precision-Recall Curve

This curve plots precision against recall for different threshold values, helping you visualize the trade-off between precision and recall and choose an appropriate threshold for your specific problem.

from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2, random_state=42)

# Train a logistic regression model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

# Calculate predicted probabilities
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate precision-recall curve
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_proba)

# Plot the curve
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.show()

You’ll get a graph like this for the breast cancer dataset:

It’s not the nicest-looking example, but you get the idea. The graph isn’t the most useful part, anyway; now that we’ve calculated the curve, we can set a target recall or precision score. Since we’re dealing with cancer in this example, it makes sense to target a recall of 1.0: we want our model to identify cancer every time it exists, even at the cost of a few false positives.

Here’s how we can do that:

target_recall = 1.0

# the `thresholds` array has one less element than
# the `precision` and `recall` array so we slice to -1
idx = np.where(recall[:-1] >= target_recall)[0][-1]
optimal_threshold = thresholds[idx]

print("Optimal decision threshold:", optimal_threshold)

# Outputs...
> Optimal decision threshold: 0.08248430372180598

This means that if we want to do our very best to identify each case of cancer, we should have the model output a positive prediction when its predicted probability is as low as 0.08.

By the way, this is why I always have classifiers output their probabilities rather than their class predictions. When you have access to the probabilities, you can decide on any threshold you want for class predictions. Perhaps the clinic wants to do something like this:

classifications = [
    "immediate follow-up",
    "additional tests recommended",
    "no cancer"
]

conditions = [
    y_pred_proba >= optimal_threshold,
    (y_pred_proba >= 0.01) & (y_pred_proba < optimal_threshold),
    y_pred_proba < 0.01
]

new_labels = np.select(conditions, classifications)

This would flag scores:

Above 0.08 as requiring immediate follow-up.
Between 0.01 and 0.08 as lower priority (suggesting additional tests).
Below 0.01 as safe and not requiring follow-up.

The clinic could even take it a step further and queue their follow-up patients according to their score, ensuring that those with the highest risk are seen first.

Having access to the underlying predicted probabilities allows for this kind of flexibility in decision making. Each organization or individual can tailor their use of the model to suit their specific needs and risk tolerance.

To summarize: while it can be helpful to follow typical thresholds like 0.5 for binary classification problems, there are many situations where finding and applying an optimal decision threshold is critical to make the best use of the model. This can help maximize recall or any other metric that is most relevant to the specific application, while also allowing for flexible decision-making.

Now let’s look at few other scoring metrics and techniques.

F-beta Score

We discussed the F1 score, which is the harmonic mean of precision and recall. The F1 score is a useful metric when dealing with imbalanced datasets, as it balances the trade-off between precision and recall. However, there might be situations where you want to give more importance to either precision or recall. This is where the F-beta score comes into play.

The F-beta score is a generalization of the F1 score, allowing you to assign different weights to precision and recall by changing the value of beta. The formula for the F-beta score is defined as:

F-beta = (1 + beta²) * (precision * recall) / ((beta² * precision) + recall)

If beta > 1, then recall has a higher importance. If 0 < beta < 1, then precision has a higher importance. The F1 score is just a special case of the F-beta score when beta = 1.

Let’s see how to calculate the F-beta score using the breast cancer dataset example:

import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import fbeta_score, make_scorer

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Define the logistic regression model
log_reg = LogisticRegression(max_iter=5000, solver='liblinear')

# Define the hyperparameter search space
param_dist = {
    'C': np.logspace(-4, 4, 20),
    'penalty': ['l1', 'l2']
}

# Create the F-beta scorer with beta = 2
f_beta_scorer = make_scorer(fbeta_score, beta=2)

# Create the RandomizedSearchCV object
random_search = RandomizedSearchCV(
  log_reg, param_distributions=param_dist,
  n_iter=20, scoring=f_beta_scorer, cv=5
)

# Fit the RandomizedSearchCV object to the data
random_search.fit(X, y)

# Print the best parameters and the corresponding F-beta score
print("Best parameters found: ", random_search.best_params_)
print("Best F-beta score: ", random_search.best_score_)

In this example, we’ve set beta to 2, which means that recall has a higher importance than precision. You can change the value of beta to suit your needs.

By using the F-beta score, you can fine-tune your model’s performance according to the specific requirements of your application. We’ve been an example where recall is more important than precision, but there are many instances where you should prioritize precision instead. For example, in a spam detection system, you might want to prioritize precision (0 < beta < 1) to avoid marking legitimate emails as spam.

The F-beta score is a powerful evaluation metric that allows you to balance the trade-off between precision and recall according to your specific needs. By combining this metric with the precision-recall curve and optimal decision thresholds, you can build more effective and flexible classification models that cater to a wide range of applications and risk tolerances.

Model Calibration

Calibration measures how well the predicted probabilities of a model match the true probabilities of the outcomes. A well-calibrated model should have predicted probabilities close to the true probabilities.

Let’s use weather forecasts as an example. If a weather service says there’s an 80% chance of rain, you should ideally be able to look at their past forecasts and find that it rained 80% of the time when the forecast was for an 80% chance of rain.

However, if their weather model is poorly calibrated, the actual probability of it raining might be much lower or higher than 80%. For weather this is usually just a mild inconvenience, but for cancer detection it can lead to incorrect decisions and harmful consequences.

There are several ways to measure calibration, but one common method is to use a calibration plot. A calibration plot shows the predicted probabilities on the x-axis and the observed frequencies on the y-axis. Ideally, the plot should show a diagonal line, indicating that the predicted probabilities match the true probabilities.

Here’s an example of a calibration plot for a logistic regression model:

from sklearn.calibration import calibration_curve, CalibratedClassifierCV
import matplotlib.pyplot as plt
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a logistic regression model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

# Calculate predicted probabilities using the calibrated model
y_pred_proba = model.predict_proba(X_test)[:, 1]

# Calculate calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
  y_test, y_pred_proba, n_bins=10)

# Plot the calibration curve
plt.plot(fraction_of_positives, mean_predicted_value, 'x')
plt.plot([0, 1], [0, 1], '--', color='gray')
plt.xlabel('Fraction of positives')
plt.ylabel('Mean predicted value')
plt.title('Calibration Curve')
plt.show()

As you can see, the blue x’s fall close to the gray dotted line, indicating that the model is well-calibrated. An overconfident model would have dots that fall mostly above the line, while the dots on an underconfident model would fall below.

Ensemble Methods for Logistic Regression

Ensemble methods combine multiple models to improve overall performance. Two popular ensemble techniques for logistic regression are:

Bagging: This technique trains multiple base models independently on random subsets of the training data, then combines their predictions through majority voting or averaging. An example is the BaggingClassifier in scikit-learn.
Boosting: This technique trains multiple base models sequentially, with each model learning from the errors of its predecessor. An example is the AdaBoostClassifier in scikit-learn.

Here’s an example of using bagging with logistic regression in scikit-learn:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import BaggingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Create a logistic regression model
base_model = LogisticRegression(max_iter=5000, solver='liblinear')

# Create a bagging ensemble of logistic regression models
bagging_model = BaggingClassifier(
  estimator=base_model, n_estimators=10)
bagging_model.fit(X_train, y_train)

# Evaluate the ensemble model
y_pred = bagging_model.predict(X_test)
score = roc_auc_score(y_test, y_pred)
print("Bagging ROC-AUC score:", score)

And an example of boosting using AdaBoost:

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Load the breast cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
  X, y, test_size=0.2)

# Create a logistic regression model
base_model = LogisticRegression(solver='liblinear')

# Create an AdaBoost ensemble of logistic regression models
boosting_model = AdaBoostClassifier(
  estimator=base_model, n_estimators=50)
boosting_model.fit(X_train, y_train)

# Evaluate the ensemble model
y_pred = boosting_model.predict(X_test)
score = roc_auc_score(y_test, y_pred)
print("Bagging ROC-AUC score:", score)

Regularization Path Visualization

Visualizing the regularization path can help you understand the effect of regularization on the coefficients of logistic regression models and the trade-off between model complexity and generalization. As the regularization strength C increases, the penalty term in the loss function becomes less dominant, allowing the coefficients to take on a wider range of values.

Here’s an example of visualizing the regularization path using scikit-learn and matplotlib:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

# Load the breast cancer dataset
data = load_breast_cancer()
X, y = data.data, data.target

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Train a logistic regression model
model = LogisticRegression(max_iter=5000)
model.fit(X_train, y_train)

# Define a range of regularization strengths
C_values = np.logspace(-2, 4, 20)

# Train logistic regression models with different regularization strengths
coefficients = []
for C in C_values:
    model = LogisticRegression(penalty='l1', solver='liblinear', C=C)
    model.fit(X_train, y_train)
    coefficients.append(model.coef_)

# Reshape the coefficients array to match the dimensions of the C_values array
coefficients = np.array(coefficients).reshape(len(C_values), -1)

# Plot the regularization path
plt.figure(figsize=(10, 6))
plt.plot(np.log10(C_values), coefficients)
plt.xlabel('log10(C)')
plt.ylabel('Coefficients')
plt.title('Regularization Path')
plt.show()

You can see how the coefficients change as the regularization strength increases. The lines start close to zero at 10e-4 and branch out all over the place as they reach 10e4. This branching effect occurs because the penalty term in the loss function becomes less dominant as the regularization strength increases, allowing the coefficients to take on a wider range of values.

While grid search and cross-validation are the go-to methods for selecting the optimal regularization strength, regularization path visualization can serve as a complementary tool for exploring and interpreting your logistic regression model. It provides valuable insights into the relationship between regularization strength and model coefficients, feature importance, and the trade-off between model complexity and generalization. By combining regularization path visualization with techniques like cross-validation, you can gain a deeper understanding of your model and make more informed decisions when selecting the appropriate regularization strength.

Conclusion

In this tutorial, we explored more advanced techniques for classification and logistic regression, including feature selection, hyperparameter tuning, model evaluation metrics, ensemble methods, and regularization path visualization. By applying these techniques, you can build more accurate and interpretable logistic regression models for various real-world applications.

My complete series on logistic regression:

Logistic Regression: An Introduction
Regularization in Logistic Regression
Advanced Techniques in Logistic Regression — Part 1
Advanced Techniques for Logistic Regression & Classification — Part 2
Mastering Logistic Regression in Python with StatsModels
Colab Notebook