Advanced Techniques in Logistic Regression — Part 1

Vincent Favilla
6 min readJun 1, 2023

--

View the accompanying Colab notebook.

So far in this logistic regression series, we’ve explored the basics of logistic regression and regularization, discussed L1 and L2 regularization, and explored the concept of convexity in the context of regularization. Now it’s time to delve into advanced techniques in logistic regression, including handling nonlinear relationships, addressing multicollinearity, feature scaling and normalization, and handling categorical variables. Buckle up — there’s lots to cover!

Handling Nonlinear Relationships

Logistic regression assumes a linear relationship between the features and the log-odds of the target variable. However, in real-world problems, the relationship between features and the target variable can often be nonlinear. To capture nonlinear relationships, we can use the following techniques:

Polynomial features

Creating new features by raising existing features to higher degrees can help capture nonlinear relationships. For example, if we have a feature x, we can create new features x², x³, etc. Scikit-learn provides the PolynomialFeatures class to generate polynomial features easily. Here’s the simplest example:

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame()
df['x'] = pd.Series([1,2,3])

poly = PolynomialFeatures(degree=2)

# Make sure to use double brackets around the column name
X_poly = poly.fit_transform(df[['x']])

# Give your new polynomnial features descriptive column names
X_poly = pd.DataFrame(X_poly, columns=['x**0', 'x**1', 'x**2'])

# Concat to your original dataframe
df_concat = pd.concat([df, X_poly], axis=1)

pd.DataFrame(X_poly)

Your dataframe will look like this:

Column 0 represents X⁰ (i.e., 1), while column 1 contains the original values (X¹). Finally, column 2 is X².

Now you’re ready to see if raising this feature to a higher degree will improve your model’s performance.

Interaction terms

Now, if you already have some experience with pandas, you might be thinking that you could just as easily create new columns by doing something along the lines of df['x squared'] = df['x']**2 .

You’d be right. The real power of PolynomialFeatures comes from its ability to create interaction terms.

Creating new features by multiplying two or more existing features can help capture the combined effect of multiple features on the target variable. For example, if we have features x and y, PolynomialFeatures will by default create a new feature x*y .

import pandas as pd
from sklearn.preprocessing import PolynomialFeatures

df = pd.DataFrame()
df['x'] = pd.Series([1,2,3])
df['y'] = pd.Series([4,7,11])

poly = PolynomialFeatures(degree=2)

X_poly = poly.fit_transform(df[['x','y']])

X_poly = pd.DataFrame(
X_poly,
columns=['bias', 'x', 'y', 'x**2', 'x*y', 'y**2']
)

X_poly will now look like this:

And it becomes even more handy as you add more columns.

Some other nonlinear relationships you may run into include exponential and logarithmic relationships. But because PolynomialFeatures is, well, polynomial, it’s not equipped to deal with non-integer exponents. Instead we can use scikit-learn’s FunctionTransformer .

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer

# Create some sample data
df = pd.DataFrame()
df['x'] = pd.Series([1,2,3])
df['y'] = pd.Series([4,7,11])

# Define a function to apply the logarithm transformation
def log_transform(X):
# it's helpful to add a small constant since log(0) is undefined
return np.log(X + .01)

# Create a FunctionTransformer with the log_transform function
log_transformer = FunctionTransformer(log_transform)

# Fit and transform the data
X_log = log_transformer.fit_transform(df)

df_concat = pd.concat([df, X_log], axis=1)

Alternatively, this may in fact be a case where it’s easier to use a vector operation:

import pandas as pd
import numpy as np

df = pd.DataFrame()
df['x'] = pd.Series([1,2,3])
df['y'] = pd.Series([4,7,11])

cols = df.columns

for col in cols:
df[f'log_{col}'] = np.log(df[col] + 0.01)

You could do something similar for creating columns for square roots, if you’re so inclined.

Now that we’ve done some feature transformations, let’s look at other ways to improve a model’s performance.

Addressing Multicollinearity

Multicollinearity occurs when two or more features in a dataset are highly correlated, which can lead to unstable estimates of the model coefficients and reduced interpretability. To detect and handle multicollinearity, we can use the following techniques:

Correlation matrix

Calculate the correlation coefficients between all pairs of features and visualize them using a heatmap. Features with high correlation coefficients can be considered for removal or combined into a single feature.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.DataFrame({
'a': [1,2,3,4],
'b': [4,2,4,3],
'c': [5,5,1,0],
'd': [2,2,3,0],
})

corr_matrix = df.corr()
sns.heatmap(corr_matrix, annot=True)
plt.show()

This gives us the following heatmap:

heatmap

We can see that there’s a very strong negative relationship between variables “a” and “c”, and your model may improve if you remove one of them. If you want a good rule of thumb, any variables with a ±0.7 correlation or stronger should be considered for removal.

Variance Inflation Factor (VIF)

VIF measures the extent to which the variance of a model coefficient is inflated due to multicollinearity. Features with a VIF greater than a certain threshold (e.g., 5 or 10) can be considered for removal or combined into a single feature.

from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(df.values, i) for i in range(df.shape[1])]

# Display VIF
pd.DataFrame({'vif': vif[0:]}, index=df.columns).T

Interestingly, this method singles out “b” as having a high VIF and thus a candidate for removal. Bear in mind that although “a” and “c” have a strong correlation with each other, they may not have a strong correlation with the other variables in the dataset. On the other hand, “b” may have a high correlation with multiple other variables, leading to a high VIF.

As always, experimentation and cross-validation is key to building a robust model.

Regularization

As discussed in my last article, regularization techniques like L1 and L2 can help mitigate the effects of multicollinearity by shrinking the coefficients of correlated features.

Feature Scaling and Normalization

Feature scaling is important in logistic regression, as it ensures that all features contribute equally to the model. Different scaling techniques include:

- MinMaxScaler: Scales features to a specific range, usually such that 0 represents the minimum value and 1 represents the maximum value.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)

- StandardScaler: Standardizes features by converting numbers to their z-scores. A z-score is the number of standard deviations (positive or negative) a number is away from the mean.

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Which scaling technique to use depends on the data and the model. For example, if the data is normally distributed, then StandardScaler is a good choice. If the data is not normally distributed, then MinMaxScaler may be a better choice.

Handling Categorical Variables

Logistic regression requires numerical input features. To include categorical variables in the model, we need to convert them into numerical values using encoding techniques:

One-hot encoding: Creates binary features for each category of a categorical variable. This can be done in a number of ways, but I think the pandas get_dummies() function is the easiest:

import pandas as pd

# drop_first helps to reduce collinearity
df_dummified = pd.get_dummies(df, prefix='category', drop_first=True)

Be careful doing this if you have a lot of different categorical values, as it can easily create hundreds or even thousands of new columns. If you want finer control over what gets one-hot encoded, you can can always pass a column to get_dummies() instead and then concatenate it to your dataframe:

import pandas as pd

# Let's say we have a "month" column we need to one-hot encode
month_dummified = pd.get_dummies(df['month'], prefix='category',
drop_first=True)

df = pd.concat([df, month_dummified], axis=1)

Ordinal encoding: Assigns an integer value to each category of a categorical variable based on its rank. This can be done using scikit-learn’s OrdinalEncoder class.

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder

encoder = OrdinalEncoder()
X_encoded = encoder.fit_transform(df[['month']])
# Note the double brackets to keep it as a DataFrame

# Convert the encoded data to a dataframe and assign column names
X_encoded_df = pd.DataFrame(X_encoded, columns=['encoded_month'])
# Replace 'encoded_month' with an appropriate column name

# Concatenate the encoded dataframe with the original dataframe
df = pd.concat([df, X_encoded_df], axis=1)

Conclusion

We covered a lot here. In this post we delved into advanced techniques in logistic regression, including handling nonlinear relationships, addressing multicollinearity, feature scaling and normalization, and handling categorical variables. These techniques will help you to build more accurate and robust logistic regression models.

In the next part of the series, we’ll continue exploring advanced techniques, such as feature selection, hyperparameter tuning, ensemble methods, and regularization path visualization. For further reading and resources, consider exploring the following links:

My complete series on logistic regression:

--

--

Vincent Favilla

I'm a data scientist & AI enthusiast, exploring trends & sharing insights. Passionate about large language models & collaborative learning. Let's grow together!