Ode to Logistic Regression

Published in

ABN AMRO Developer Blog

12 min readMay 13, 2024

In the 2024 Oscar acceptance speech for the Best Cinematography, Dutch cinematographer Hoyte van Hoytema urged “(…) all aspiring filmmakers out there to (…) try shooting that incredible new hip thing called celluloid” (see full video here). I personally interpreted the essence of statement to mean that ‘older’ or ‘classical’ techniques can lead to great results when applied correctly. It may even be the case that van Hoytema’s means that the traditional technique is (still) even superior to modern or alternative techniques.

In this post, I wish to make a similar call to action for the use of logistic regression and discuss the benefits of using simple logistic regression models in a day and age where XGBoost models and transformers are all the rage. I hope to convince you that logistic regression can not only be an effective way to create a model, but that it may offer benefits over more complex data science or machine learning approaches that often get applied without much consideration.

Let’s dive in!

The business case

Over the past months we’ve been working on a personalized marketing model at ABN AMRO, one of the largest Dutch banks. The intended use of this model is quantifying the expected outcome of sending a marketing communication to a specific customer for a specific product. The end goal is to create a probabilistic model P(Y = y │ X, T) for the event Y that a customer with features X will buy this product after receiving a marketing communication T (T is often used to mean “Treatment” in the scientific literature). The variable X consists of features such as age, which banking products a customer currently owns, click behavior and so on. T is a binary variable indicating if a customer received a marketing communication or not.

To collect data, we performed controlled experiments to collect unbiased data where we sent a marketing communication to a subset of customers (the group T = 1) while others (T = 0) received no such message. Our data looks like this:

If we create a contingency table of our data, it might look like this:

By looking at the collected datasets in our experiments, we made the following observations. Firstly, we concluded that conversion in general was a relatively rare event, with P(y) typically of the order somewhere between 10–3 and 10–2. We typically also saw an improvement in the conversion for the test group in a statistical sense (using the Chi square test), but that this improvement was not very large in an absolute sense. Our data is thus quite unbalanced, even for the test group. To accurately predict using this data, we need a model that can handle this imbalance.

Advantages of Logistic Regression

There are many possible choices to model . Let’s first set some criteria for what we want from our model:

The model should provide good (low variance) and unbiased probability estimates of P(y | X).
The model should provide insight into the the data generating process and the drivers of conversion.
We want to use as little data as possible.
Model development, deployment and maintenance should be fast, scalable and cheap.

Note that these model criteria differ slightly from typical machine learning requirements. We are not interested in classifying customers into “conversions” or “non-conversions” per se, but we are rather interested in creating a probabilistic model that gives insight into both the probability of conversion and the drivers thereof. In a more modern wording, what we are looking for is a causal model.

In the next section, I give three arguments what makes logistic regression a good choice if you have the above three requirements and will shortly discuss some of the disadvantages of this model as well.

1. Logistic Regression models provide good and unbiased probability estimates.

Logistic regression works well in the low data or imbalanced data regime, because it has a very low number of degrees of freedom (parameters that need to be fitted to the data) and gives very good probability estimates. In any experimental setting, this makes using logistic regression a very cost effective and widely applicable model.

In technical terms, is a so called Generalized Linear Model (GLM). The model specification of a logistic regression looks like:

That means that the degrees of freedom of the model is only linear in the number of independent variables, meaning you only need limited amounts of data to fit the model. If the independent feature distribution in the control and test groups are independent and identically distributed, and if the experimental data is representative for the true population, we quickly have so much data that it becomes extremely difficult for the model to overfit the data, because it doesn’t have the complexity to do so.

Furthermore, logistic regression has very convenient statistical properties.

Firstly, we evaluate the performance of the model using statistical tools like the widely applicable Akaike Information Criterion (WAIC), Bayesian Information Criterion (BIC), Likelihood-Ratio test, Wald test and calibration curve. These tests are evaluated on the original dataset and as a result we don’t require a train-test split, meaning we need less data to construct and compare different models.

Secondly, we can use Wald and Likelihood-Ratio tests to determine if covariates have a statistical significant influence on the dependent variable. Furthermore, because we are using a generalized linear model (GLM), we don’t need tools like SHAP to explain why our model performs a certain way. We can simply read and interpret the coefficients from each covariate in our model after having fitted it to the data, which gives us direct insight into how different covariates affect the dependent variable y.

Lastly, logistic regression gives very well calibrated probability estimates in a wide range of cases, straight out of the box. This means that, on average, the probabilities estimated by the model closely match the probabilities found per individual in the actual true data distribution. This is incredibly important if you want to estimate quantities based on those probabilities, such as expected values of certain populations or risk assessments, or if you want to compare several different model output scores with each other.

Contrast this to more complex models such as random forests or neural networks. While these models are much more flexible and can fit complicated patterns in the data, the degrees of freedom in these models are generally very large. This means that you either need large amounts of data, or make use of regularization to prevent overfitting your models. If you don’t have a solid theoretical reason as to why you actually need a more flexible model, then choosing a simple model like logistic regression is often a good choice. In essence this statement is an example of Occam’s Razor: you should always strive to simplify your analysis as much as possible while at the same time not oversimplifying your analysis to the point where it becomes meaningless. On the other hand, it is also common sense: start simple and adjust if necessary.

If you still want to use a more flexible model, the use of regularization may suppresses overfitting. However, using regularization can also obscure and distort the influence of covariates on the dependent variable or skew variable importance. This means that your models may become good predictive models, but they become less useful as causal models to understand the data generating process or estimate the true population probabilities. As a sidenote, regularization not only affects complex models. Using techniques like LASSO in logistic regression may also bias the covariate estimates. Luckily, in our case using regularization is not necessary: we aren’t looking for great classification, only good statistics.

Lastly, the estimated probabilities from many machine learning models are often very far from being calibrated. This may not be a problem for classification tasks, but it also means that you cannot directly compare scores outputted from model A to those from model B or use the model output as an actual probability. There are ways to compensate by using techniques such as Isotonic Regression or Plat Scaling (see for instance https://arxiv.org/abs/1207.1403), but using these methods effectively would require gathering more data and even then you may not achieve perfect calibration in practice. Logistic regression does not require any extra attention and takes care of all these things for you.

2. Logistic Regression can help give deep insight into the data generating process

Logistic regression allows you to get a deeper understanding of the process you are modelling by using statistical tools and tests such the WAIC score, and test such as the Wald test or the Likelihood-Ratio test to analyze your models and data.

Consider these different candidate models with increasing degrees of freedom and interaction:

When we tested these models in our actual business case, it turned out that models that have the form of model M2 minimizes the WAIC scores, while at the same time LR-tests showed a statistically significant improvement between M1 and M2 but not between M2 and M3 or M4:

When we tested these models in our actual business case, it turned out that models that have the form of model M2 minimize the AIC scores, while at the same time LR-tests showed a statistically significant improvement between M1 and M2 but not between M2 and M3 or M4:

These findings can be interpreted to mean that there is no strong statistical evidence that more complex models with different possible interaction explain the data better than a simple model does. When you add more degrees of freedom, you may be overfitting your data instead of truly improving your model.

It is important to note that this evidence does not necessarily mean that model M2 is objectively the one and only “true model” . It just means given our collected data, we have no strong evidence to believe that more complex models were better at explaining our data or in fact required to do so. It is possible that if we gathered more and more data, our conclusions could change. But with our current variables and current data, the simple model M2 seems to be “the best” we can do if we want to do inference on the drivers of conversions.

For us, this conclusion made sense from a theoretical perspective too. Why should we be able to explain more of the variance with complicated interaction terms, given our available variables?

We only observe a set of basic variables per customer. We therefor assume it is very likely that the true reasons for choosing to purchase a certain product lie outside what we know about each individual customer; for instance, buying a new house or getting a new job. Because of this, we also think we are better off to just (assume we can) average over a large part of this variance we see in our data, and be able to say something about the mean effect of age on conversion rate, rather than trying to explain away this variance using a more complicated model.

Lastly, because our model is a generalized linear model, we can use all the classical statistical machinery to do inference on the coefficients of the covariates in the model. With these inference methods you can easily analyze if certain covariates don’t seem to contribute (much) to the decision to buy not only in a statistical sense, but also in an absolute sense. We can give a good confidence interval for the estimated strength of our marketing communication T and our covariates X and use these findings in discussions with the marketing team.

Being able to reason about data this way is enormously powerful. Not only does it give you a deeper insight into your data, but it can also save you from trying to find ‘the best model’ when you were already doing it the best you could likely hope given the data you have available.

3. Development and deployment are both fast and cheap

Lastly, the computational cost of using logistic regression is only linear in the number of parameters and the number of datapoints, doesn’t require hyper parameter tuning (without the use of LASSO, like we are doing), and has very efficient available estimation algorithms available. In short this means that logistic regression can fit and predict on millions of datapoints, using cheap hardware, with low cost and low time latency. This makes logistic regression an potentially excellent tool to use a wide range of data science applications and highly scalable in cases where fast response times are needed, such as bidding auctions for advertising.

Furthermore, there are hardly any major programming languages that don’t have any form of support for logistic regression through the form of external libraries or packages. This makes logistic regression an easy and safe choice from a development perspective as well.

Possible caveats of Logistic Regression

Of course, there are also situations where using logistic regression is not a good fit. Knowing these can help you decide if logistic regression is a good candidate for your business case. I will quickly go over the three common cases.

1. Non-linear relationship between dependent and independent variables

Logistic regression is a (generalized) linear model. If your data generating process is inherently non-linear or you expect the relationship between logit(y) and you independent variables not to be linear, logistic regression may not be a good match. Luckily, many statistical problems can be reasonably assumed to be (approximately) linear near a point of interest, so logistic regression can be used in those cases and still give good results. But it is important to at least be aware of the limitations.

2. Collinearity in independent variables

If your independent variables are correlated, logistic regression can run into trouble either resulting in biased inference estimates or skewed probability estimates or not converging to a solution all together. Multicollinearity “dilutes” the relative importance of a coefficient (for a specific variable) over the variables it correlates with, generating a misleading interpretation of the resulting model.

When using variables in logistic regression, you should always make sure that collinearity is well controlled. A good way to do this is by constructing a covariance matrix between the variables and inspecting it or by calculating the variance inflation factor (VIF) per variable and ensuring it is below a threshold value.

If you have correlation among your variables yet still want to create a probabilistic model of your data, you may benefit from finding a way to include instrumental variables in your model. This topic is way too broad to cover in this article and would deserve an article of its own, but if you encounter this problem I advise you to look into this yourself. It’s a really valuable technique!

3. Complex relationships between independent variables

Because logistic regression is a linear model, you need to include feature engineered variables in your model if your data contains complicated interaction patterns. An advantage of algorithms such as XGBoost or Neural Networks is that they do the heavy lifting for you. Creating features manually, can also be seen as an advantage, in a certain way. By doing the work by hand, you also understand fully what variables are being used in the model and these created features can be discussed with domain expert to make sure that they make sense, increasing the robustness of your model. But the downside is that you do have to do it yourself.

Closing thoughts

I believe that, in a sense as van Hoytema did with celluloid film, that sometimes when new methods become available, we can sometimes forget that classical methods already have fantastic properties.

Yes, you may have to be careful with collinearity and yes, you may need to do some feature engineering, but if you are looking for causal inference or a probabilistic model, the results using Logistic Regression in my opinion are superior in many cases to using more complex methods.

Logistic regression is a great and practical tool that has a bunch of desirable features. The technique is fast, it gives you deep insight into the data generating process and allow you to use statistical inference, and it gives you very nicely calibrated probability estimates that are hard to overfit on data. If you have come this far, thank you for reading my article and I hope you give the model a try!