DATA SCIENCE THEORY | LOGISTIC REGRESSION | KNIME ANALYTICS PLATFORM

Easy Interpretation of a Logistic Regression Model with Delta-p Statistics

Understand and assess easily the individual effects that make a credit application succeed or fail

Maarit Widmann
Low Code for Data Science

--

As first published in InfoQ. Co-author: Alfredo Roccato

Photo by The New York Public Library on Unsplash.

Key Takeaways

  • With Delta-p statistics, the predictions based on a logistic regression model are easy to understand by non-technical decision-makers.
  • Learn how to calculate the Delta-p statistics based on the coefficients of a logistic regression model for credit application processing.
  • Data workflow includes the steps for accessing the raw data to train the logistic regression model and evaluating the effects of individual predictor columns with Delta-p statistics.
  • Keep in mind logistic regression might not be the best choice when working with high-dimensional data, with many correlated predictor columns.

Imagine a situation where a credit customer applies for a credit, the bank collects data about the customer — demographics, existing funds, and so on — and predicts the creditworthiness of the customer with a machine learning model. The customer’s credit application is rejected, but the banker doesn’t know why exactly. Or, a bank wants to advertise their credits, and the target group should be those who eventually can get a credit. But who are they?

In these kinds of situations, we would prefer a model that is easy to interpret, such as the logistic regression model. Delta-p statistics make interpretation of the coefficients even easier. With Delta-p statistics at hand, the banker doesn’t need a data scientist to be able to inform the customer, for example, that the credit application was rejected because all applicants who apply for credit for educational purposes have a very low chance of getting credit. The decision is justified, the customer is not personally hurt, and he or she might come back in a few years to apply for a mortgage.

In this article, we explain how to calculate the Delta-p statistics based on the coefficients of a logistic regression model. We demonstrate the process from raw data to model training and model evaluation with a KNIME workflow where each intermediate step has a visual representation. However, the process could be implemented in any tool.

Assessing the Effect of a Single Predictor with the Delta-p Statistics

We will describe the functioning of the logistic regression model and the interpretation of its coefficients. After that, we will explain how to calculate and interpret the Delta-p statistics based on the coefficients. Finally, we will introduce a real-world example of comparing two predictor columns of a credit scoring model with the Delta-p statistics.

Logistic Regression Model

When we use the logistic regression algorithm for classification, we model the probability of the target class, for example, the probability of a bad credit rating, with a logistic function . Let’s say we have a binomial logistic regression model with a target column y, credit rating, with two classes that are represented by 0 (good credit rating) and 1 (bad credit rating). The log odds of the target class (y=1) vs. the reference class (y=0) is a linear combination βx of the predictor columns x (account balance, credit duration, credit purpose, etc.). A logistic function of βx transforms the log odds into a probability of the target class:

Easy Interpretation of a Logistic Regression Model with Delta-p

where β is the vector of coefficients for predictor columns x in the logistic regression model that predicts target class y.

The target and reference classes can be arbitrarily chosen. In our case, the target class is “bad credit rating,” and the reference class is “good credit rating.”

Delta-p Statistics

If a single predictor column x_i is continuous, the coefficient β_i corresponds to the change in the log odds of the target class when x_i increases by 1. If x_i is a binomial column, the coefficient value β_i is the change in the log odds when x_i changes from 0 to 1. The change in the probability of the target class is provided by the logistic function, as shown in Figure 1.

Easy Interpretation of a Logistic Regression Model with Delta-p
Figure 1. Logistic function modeling the probability of the target class y=1 as a function of one continuous predictor column x_i.

The Delta-p statistics transforms the coefficient values β_i into percentage effects on the probability of the target class compared to an average data point e.g., an average credit applicant.

By definition, the Delta-p statistic is a measure of the discrete change in the estimated probability of the occurrence of an outcome given a one-unit change in the independent variable of interest, with all other variables held constant at their mean values. For example, if the Delta-p value of a predictor column x_i is 0.2, then a unit increase in this column (or a change from 0 to 1 in a binomial column) increases the probability of the target class by 20 %. The probabilities before and after the unit increase are called prior and post probabilities, respectively. The following formulas show how to calculate the prior and post probabilities of the target class and, finally, the Delta-p statistics as their difference [1]:

Easy Interpretation of a Logistic Regression Model with Delta-p

Use Case: The Effect of Credit Purpose and Current Account Balance on Credit Rating

Let’s now demonstrate this with an example, and check how the credit purpose and balance of an existing account improve or worsen the credit rating. We use the German credit card data provided by the UCI Machine Learning Repository. The dataset contains 21 columns that provide information about demographics and economic conditions of 1,000 credit applicants. Thirty percent of the applicants have a bad credit rating, and 70 % have a good rating. You can download the data in .data format by clicking “Data Folder” at top of the page, and selecting the “german.data” item on the next page. The german.data file can be opened in a text editor and saved, for example, in CSV format. The column names and descriptions of the values in the categorical columns are provided in the german.doc file, accessible via the same page.

The workflow in Figure 2 shows the process from accessing the raw data to training the logistic regression model and evaluating the effects of individual predictor columns with Delta-p statistics.

The process is divided into the following steps, each one implemented within a separate colored box: Accessing data (1), preprocessing data as required by the logistic regression algorithm (2), training the model (3), and calculating the Delta-p statistics based on the model coefficients (4). In the preprocessing step, we convert the target column from the 1/2 notation to “bad”/“good.” We also transform two originally multinomial predictor columns into binomial columns: We encode the “checking” column into two values “negative”/“some funds or no account” based on the status of the existing bank account. We encode the “purpose” column into values “education”/“no education” to assess the effect of education as a credit purpose. Finally, we handle missing values and normalize the numeric columns in the data.

Easy Interpretation of a Logistic Regression Model with Delta-p
Figure 2. The process from accessing raw credit customer data to training a credit rating model, and to evaluating the effects of predictor columns on the credit rating with Delta-p statistics. This solution was built in KNIME Analytics Platform, and the Assessing Effects of Single Predictors with Delta-p workflow can be inspected and downloaded on the KNIME Hub.

Figure 3 shows the coefficient statistics of the logistic regression model, reproducible in any tool. The “Coeff.” column shows the coefficient values for the different predictor columns, 0.683 for purpose=education. The “P>|z|” column shows the p-values of the coefficients, 0.055 for purpose=education. This means that education as a credit purpose increases the probability of a bad credit rating, since the coefficient value is positive, and this effect is significant at 90 % significance level, since the p-value is smaller than 0.1.

Easy Interpretation of a Logistic Regression Model with Delta-p
Figure 3. Coefficient statistics of a logistic regression model that predicts the credit rating good/bad of a credit applicant.

By looking at the coefficient statistics of the logistic regression model, we find out that education as a credit purpose increases the probability of a bad credit rating compared to other credit purposes. In addition, the coefficient value 0.683 tells that the log odds ratio for getting a bad credit rating with/without education as the credit purpose is 0.683, and the odds ratio of the two groups is e^0.683=1.979. What would this mean, for example, in a group of 100 credit applicants, let’s say 20 of them with education as the purpose (group 1) and the remaining 80 with another purpose (group 2)? If 10 out of the 80 applicants in group 2 have a bad credit rating, so their odds is 0.125, then according to the odds ratio 1.979, the odds for group 1 must be ~2 times the odds of group 2, so 0.25 in this case. Therefore a quarter (in this case 5) of the applicants in group 1 must have a bad credit rating!

The coefficient statistics have a universal scale, and we can use them to compare the magnitude and the effect of different predictor columns. However, to understand the effect of a single predictor, the Delta-p statistics provide an easier way! Let’s take a look:

In Figure 4 you can see the Delta-p statistics and the intermediate results in calculating it, also shown below for the purpose=education variable:

Easy Interpretation of a Logistic Regression Model with Delta-p
Easy Interpretation of a Logistic Regression Model with Delta-p
Figure 4. Delta-p statistics, its intermediate results, and the corresponding coefficient statistics of a logistic regression model that predicts the credit rating good/bad of a credit applicant.

The value 0.159 of the Delta-p statistics indicates that education as a credit purpose increases the probability of a bad credit rating by 15.9 % compared to an average credit application.

If we wanted to compare the effect to the opposite situation, i.e., the credit purpose is not education, instead of an average credit applicant, we would need to recalculate the prior probability and also mean-center the binomial values of the predictor column of interest x_i. In our data, 5 % of the people apply the credit for education purposes, so the mean of the “purpose” column x_i is 0.05.

Easy Interpretation of a Logistic Regression Model with Delta-p

The value 0.158 of the Delta-p statistics indicates that the credit applied for educational purposes increases the probability of a bad credit rating by 15.8 % compared to those who apply it for other purposes. There’s hardly any difference to the previous situation where we compared against an average applicant and obtained the Delta-p value of 0.159 (Figure 4). This means that the credit applicants with other purposes than education are very close to the sample average in terms of their credit rating, apparently because they make up 95% of the total sample.

Now we know that applying for credit for education purposes has a negative effect on the credit rating. Which column could have a positive effect? Let’s check the effect of the other dummy column that we created, the “checking” column that tells if the balance of the existing account is negative. The coefficient value of checking=some funds or no account is -1.063 with a p-value 0, as you can see in the first row in Figure 3.

As the Delta-p statistics -0.171 in the first row in Figure 4 show, credit applicants with no negative account balance tend to have a 17.1 % lower probability of a bad credit rating than an average credit applicant. Interestingly, we found two columns, purpose and checking, that have an effect of almost the same size but in a different direction. If we look at the odds ratio of these two variables in Figure 4, we wouldn’t get the same information at first glance: The odds ratio is 0.345 for checking=some funds or no account and 1.979 for purpose=education.

Conclusions

In this article, we have introduced Delta-p statistics as a straightforward way of interpreting the coefficients of a logistic regression model. With Delta-p statistics, the predictions based on a logistic regression model are easy to understand by non-technical decision-makers.

We used Delta-p statistics to assess the individual effects that make a credit application succeed or fail. Of course, the use cases of Delta-p statistics are many more. For example, we could use Delta-p statistics to determine the individual touchpoints that decrease or increase customer satisfaction the most or to find the symptoms with the highest relevance, when detecting a disease. Also, notice that not always the whole process from raw data to model training and model evaluation needs to be completed, Delta-p statistics can also be used to re-evaluate the coefficients of a previously trained logistic regression model.

Delta-p statistics can only be used to assess the individual effects of predictor columns in a logistic regression model. Logistic regression might not be the best choice when working with high dimensional data, with many correlated predictor columns, and columns not correlated with the target column. The target classes also need to be linearly separable in the feature space.

If you want to replicate the procedure described in the article, one option is to install the open-source KNIME Analytics Platform on their laptops and download the KNIME workflow attached to the article for free. A visual representation of the workflow is available on the KNIME Hub without installing KNIME Analytics Platform. Other options are to implement the calculations in another programming tool, or even perform them manually with a calculator.

References

[1] Cruce, T. M. (2009). A Note on the Calculation and Interpretation of the Delta-p Statistic for Categorical Independent Variables, Research in High Education, 50(6), 608–622.

--

--

Maarit Widmann
Low Code for Data Science

I am a data scientist in the evangelism team at KNIME; the author behind the KNIME self-paced courses and a teacher at KNIME.