Generalized Linear Model in Python

Sarka Pribylova
10 min readOct 10, 2022

--

Generalised Linear Model (GLM) is one of many models to form the linear relationship between the dependent variable and its predictors. GLMs have 3 components: random, systematic and link function. There are many GLM types: binomial, poisson, gamma, quasi, gaussian, tweedie, etc. GLMs are recently a very good and easy to understand starting point for advanced statistical methodologies. We can use GLMs e.g. for extrapolation, where gradient boosting or random forests does not perform well. We can use GLMs in financial industry to evaluate loans or stocks, in health care industry, or any other cases with multiple independent variables. The concept of GLM is straight forwarded compared to ML methodologies and there are usually needed further transformations and understanding before making a conclusion.

The purpose of this python code is to create a simple binomial (or tweedie) GLM and predict/forecast Default values for a Credit department in a Bank. Default is the response/dependent variable, failures to fulfil a loan obligation. Higher the Default, higher the chance the loan will not be repaid. Independent variables are: RiskLevel, YOB (Years On Book), LGD (Loss Given Default), EAD (exposure at default), ID (Primary Key), Year, DJX Return (Dow Jones Index), GDP (Gross Domestic Product).

Data sets are located here:

Outstanding loans — Columns: RiskLevel, YOB, LGD, EAD
Old loans — Columns: ID, RiskLevel, YOB, Year, Default
Simplified Fed history (years 2000 – 2017) — Columns: Year, DJX Return, GDP
Simplified adverse (years 2018, 2019, 2020) — Columns: Year, DJX, GDP

  1. Python code

The full Python code is available here. Install and import libraries in Google Colaboratory.

%matplotlib inlineimport numpy as np
import pandas as pd
import statsmodels.api as sm
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf

Prepare the data for work.

# Read credit portfolio. Portfolio of loans to test
myPortfolio = pd.read_excel("OutstandingLoans.xlsx")
# Read historical loans
myLoanHistory = pd.read_excel("OldLoans.xlsx")
# Read economic and financial data from the Fed
FedHistory = pd.read_csv("Simplified_FedHistory.csv")

Combine both datasets, create the GLM model and display results. You can choose from Statsmodels library a GLM family Binomial, Tweedie or Poisson, these 3 will provide a very good fit. The difference is not critical, however Tweedie is recently supposed to be designed the most for the collective risk questions. Tweedie link function is not logit, but log.

# Combine both datasets to get variables for historical values into 1 Data FramecombinedHistory = pd.merge(myLoanHistory, FedHistory, on = "Year")# Create GENERALIZED LINEAR MODEL# formula: Default stands for intercept (dependet vriable) 
# The intercept is the predicted value of the dependent variables, when all the independent variables /Risk level, YOB, DJX, GDP/ are 0.
# You can try family Tweedie with its link function log as well in this case

model = smf.glm(formula = "Default ~ RiskLevel + YOB + DJX_Return + GDP",
data = combinedHistory,
family = sm.families.Binomial())

# Fit the model
result = model.fit()
# Display and interpret results
print(result.summary())
# Estimated default probabilities
predictions = result.predict()

Interpret the results.

Variables — independent variables are those we have selected in Formula above: RiskLevel, YOB, DJX_Return and GDP. Dependent variable is the Default, it is in the first line called as Intercept, the one we predict.

Coefficients — from the regression output, we can see that the regression coefficient for RiskLevel is 0.6735. This means that, on average, each additional RiskLevel is associated with an increase of 0.6735 points on the final default, assuming the other predictor variables YOB, DJX and GDP are held constant.

Deviance — it is the variance not explained by the model. Deviance is a goodness-of-fit metric for statistical models, used for GLMs. It is the difference between the saturated (perfect) model and proposed model, how much variation in the proposed data model is. The lower the deviance, the better the model.

Df Model — Df Model are the numbers of our 4 predicting variables. We have 4 predictors here: RiskLevel, YOB, DJX, GDP.

Df Residuals — it is the sample size minus the number of parameters being estimated, df(Residual) = n — (k+1) = n — k — 1. Df Residuals is another name for our Degrees of Freedom in our model.

Link function (canonical function) — it links its linear predictor. Normal distribution has identity link function, Poisson distribution has log link function, Binomial distribution has logit link function. The best description is provided in library statsmodels.

Scale (true scale) — in Binomial and Poisson the scale is 1 under the assumption of correct specification. In gaussian and the others we have an additional scale parameter not fixed at one. In dispersed Poisson or Binomial we also have an additional scale.

[0.025 and 0.975] — these are symmetric confidence intervals. For smaller sample 0.025 is better/more precise, than 0.5 of course. 0.025 and 0.975 are measurements of values of our coefficients within 97.5% of our data. Outside of these values can generally be considered outliers. 97.5% of the area of a normal distribution of RiskLevel coefficient is between values 0.637 and 0.710. Or in other words there is a 97.5% chance that coefficient of YOB has value between -0.327 and -0.303.

Std err — the std err is the standard error / standard deviation of the coefficient point estimate in the GLM. It is a measure of uncertainty about this estimate. Too large error means, that a coefficient estimate is calculated with a lot of imprecision. E.g. the DJX coefficient is estimated more precise than GDP coefficient.

P>|z| — one of the most important statistics in the GLM summary. It uses the z statistics to produce the p-value, a measurement of how likely your coefficient point estimate is measured through the GLM model by chance. The p value of 0% for RiskLevel is saying that there is 0 chance, that RiskLevel has no effect on the dependent Default variable, therefore the results are not produced by chance at all. The p value of e.g. 0.256 for RiskLevel would be saying there is a 25.6% chance the RiskLevel variable has no affect on the dependent variable, Default, and the results are produced by chance. The P>|z| tells, if our point estimate has been calculated “well” to distinguish it from zero. We define “well” using the p<.05, because a common alpha is 0.05. P>|z| will compare the p value to a previously established alpha value, or a z threshold with which you apply significance to your coefficient.

Pearson Chi2 — the larger the Chi-square value, the greater the probability that there really is a significant difference. Chi2 is a function of degrees of freedom and true scale. It is a test of independence and makes sense to compare e.g. 2 GLM models (choose one with lower Chi2) or to compare the observed counts with their expected values under the multinomial setting. The larger the Chi-square value, the greater the probability that there really is a significant difference.

Iterations — how long it took you to fit the log likelihood in the model. Usually IRLS does not take longer than 25 iterations and you want to minimise the computational time. With each further iteration log likelihood is slightly higher/better. Many other iterative methodologies are trying to remove redundant calculations, handle better tabular matrix calculations, include more functions for automated cross validations etc.

Log likelihood — it is the natural logarithm of the likelihood. Log-likelihood is a numerical signifier of the likelihood that your produced model produced. It compares coefficient values for each variable in the process of creating the model. The likelihood is the product of the density evaluated at the observations. Usually, the density takes values that are smaller than one, so its logarithm will be negative as in this case: -27334. Higher values (less negative values) correspond to better fit, because you want to maximize the log-likelihood of the intercept. For example, a log-likelihood value of -3 is better than -7. The logarithm is monotonically increasing function of its argument, maximization of the log of a function is correlating with maximization of the function itself.

Z-score — it is a numerical measurement that describes a relationship of value to the mean of a group of values. It is standard deviation from the mean. If a Z-score is 0, the data point’s score is identical to the mean score.

Revise the AIC and the BIC. AIC (Akaike information criterion) has lower value, if we fit the model better. The same as BIC or MDL, AIC provides the basic scoring to choose the best model. BIC (Bayesian information criterion ) has lower value, better the model fit. Calculation is similar to MDL. There is usually high correlation between AIC and BIC. MDL (minimum description length) is another criterion to score a model, lower the value, better the model fit.

# Calculate the Akaike criterionprint(result.aic)# Calculate the Bayesian information criterionprint(result.bic)

Compute this Defult for years 2000–2017.

# Compute historical portfolio default rates
Default = []
for i in range(np.shape(combinedHistory["YOB"].unique())[0]):
YOB = combinedHistory.loc[combinedHistory["YOB"] == i + 1]
DefaultFreq = YOB["Default"].value_counts()
DefaultRate = (DefaultFreq.values[1]/(DefaultFreq.values[0] + DefaultFreq.values[1]))*100
Default.append(DefaultRate)
# Dataframe containing years on books and predictive model probabilities
myModel = pd.DataFrame({"YOB": np.array(combinedHistory["YOB"]), "estDefault": predictions})

Compute the estimated Default rates.

estDefault = []
for i in range(np.shape(myModel["YOB"].unique())[0]):
YOB = myModel.loc[myModel["YOB"] == i + 1]
estDefaultFreq = YOB["estDefault"].mean()*100
estDefault.append(estDefaultFreq)

Adverse economic scenario file has DJX values and GDP totals per years 2018, 2019 and 2020. File functions something like a holdout group. Among years 2000 and 2017, the economy is expanding and among years 2018 and 2020, the economy is contracting.

AdverseScenario = pd.read_csv("Simplified_Adverse.csv")# Portfolio under adverse economic conditions
AdversePortfolio = myPortfolio.assign(Year = AdverseScenario.iloc[0, 0],DJX_Return = AdverseScenario.iloc[0, 1],GDP = AdverseScenario.iloc[0, 2])
# Predicted default probabilities under adverse economic conditions. Using the created GLM.PD = result.predict(AdversePortfolio)
# Dataframe containing years on books and predictive model probabilities under adverse economic conditions. PD is calculated for each YOB, PD and myPortfolio are merged.
predPD = pd.DataFrame({"YOB": np.array(myPortfolio["YOB"]), "PD": PD})

See the DJX and GDP for years 2000 — 2020.

combined = pd.concat([FedHistory, AdverseScenario])

Compute the predicted Default rates and expected loss.

# Compute predicted default ratespredDefault = []
for i in range(np.shape(predPD["YOB"].unique())[0]):
YOB = predPD.loc[predPD["YOB"] == i + 1]
predDefaultFreq = YOB["PD"].mean()*100
predDefault.append(predDefaultFreq)
# Compute the expected loss of the loan portfolio under adverse economic conditionsExpectedLoss = sum(AdversePortfolio["EAD"]*AdversePortfolio["LGD"]*PD)

Plot the Default by YOB. If the YOB is the smallest (the most recent observations), there is a high chance of high Default. Default rates are much higher in contracting economy (adverse, 2018–2020) than in the expanding economy (historical, 2000–2017). Created GLM model is the very best fit for historical and adverse data, black line is fully covered by grey line.

# Visualization with matplotlibfrom matplotlib.pyplot import figure
figure(figsize=(8, 3), dpi=80)
# original file, historical values, basis for GLM creation, train group, years until 2017, expanding economy
plt.plot(combinedHistory["YOB"].unique(), Default, "o-", color = "black", label = "Histotical Data (2000-2017)")
plt.ylabel("Default Rate (%)")
plt.legend()
# result of GLM prediction from historical data, similar to validation group, the GLM model is the best fit
plt.plot(myModel["YOB"].unique(), estDefault, "o-", color = "grey", label = "Fitted Data (2000-2017)")
plt.xlabel("YOB (Years on Books)")
plt.legend()
# result of prediction from adverse portfolio, similar to holdout, years 2018, 2019 and 2020, contracting economy
plt.plot(predPD["YOB"].unique(), predDefault, "o-", color = "tan", label = "Predicted Adverse Data (2018,2019,2020)")
plt.legend()
plt.show()

2. Columns definitions

YOB ( years-on-books) - in this data set, the YOB information is the same as the age of the loan. All loans start with a YOB of 1. Other frequently used variable is the amount of time each loan was observed (Years Observed), which is the final value of the years-on-books (YOB) variable. This years observed is usually the number of years until default, or until the end of the observation period (12 years), or until the loan is removed from the sample due to prepayment.

Risk level — portfolio products have different risk level, low, medium or high. Riskiest loans are the bad credit personal loans, bad credit consolidation loans, payday loans, auto title loans.

LGD — Loss given default — loss — the estimated amount of money a bank or other financial institution loses when a borrower defaults on a loan.

EAD — exposure at default — EAD is the predicted amount of loss a bank may be exposed to when a debtor defaults on a loan.

Default — failure to fulfil an obligation, failure to repay a loan. Higher the Defult rate, greater the chance the loan will not be paid. It is response dependent variable, default indicator.

DJX Return — Dow Jones index — a stock market index of 30 prominent companies listed on stock exchanges in the United States. To calculate the DJIA, the sum of the prices of all 30 stocks is divided by a divisor, the Dow Divisor.

GDP — Gross domestic product growth year over year.

3. References

https://www.casact.org/sites/default/files/2022-07/RM9_AtlernativestoTweedieDistributioninGLM.pdf
https://clas.ucdenver.edu/marcelo-perraillon/sites/default/files/attached-files/week_7_glm_and_costs_perraillon.pdf
https://www.sagepub.com/sites/default/files/upm-binaries/21121_Chapter_15.pdf
https://www.researchgate.net/post/When-and-in-what-situation-gamma-GLM-is-recommended
https://stats.stackexchange.com/questions/77579/log-linked-gamma-glm-vs-log-linked-gaussian-glm-vs-log-transformed-lm
https://www.jstatsoft.org/article/download/v033i01/361
https://scikit-learn.org/stable/modules/linear_model.html
http://www.science.smith.edu/~jcrouser/SDS293/labs/lab4-py.html
https://towardsdatascience.com/scikit-learns-generalized-linear-models-4899695445fa
https://stats.stackexchange.com/questions/77579/log-linked-gamma-glm-vs-log-linked-gaussian-glm-vs-log-transformed-lm
https://www.mathworks.com/matlabcentral/fileexchange/67771-stress-testing-predicting-loss-under-adverse-economic-conditions
https://github.com/KhalilBelghouat/StressTestingLoanPortfolio
https://assets.kpmg/content/dam/kpmg/cn/pdf/en/2020/02/stress-testing-loan-portfolios-in-times-of-crisis.pdf
https://www.bis.org/bcbs/events/rtf08bjrspres.pdf
https://cms.rmau.org/uploadedFiles/Credit_Risk/Library/RMA_Journal/Other_Topics_(1998_to_present)/Stress%20Testing%20the%20Commercial%20Loan%20Portfolio%20-%20Why%20and%20How.pdf
https://cmup.fc.up.pt/cmup/engmat/2012/seminario/artigos2012/alvaro/jrmv_assouan_web.pdf
https://www.researchgate.net/publication/23755208_Stress_Testing_of_Real_Credit_Portfolios
https://publications.gc.ca/Collection/FB3-2-106-47E.pdf
https://www.investopedia.com/terms/d/defaultrate.asp
https://www.investopedia.com/terms/a/adversely-classified-asset.asp
https://www.datarobot.com/wiki/training-validation-holdout/
https://www.cepal.org/sites/default/files/publication/files/41254/RVI120_MarquesMotta.pdf
https://www.statsmodels.org/dev/glm.html
https://www.investopedia.com/articles/basics/03/050203.asp
https://towardsdatascience.com/python-for-finance-an-implementation-of-the-modern-portfolio-theory-39cdbaeefbd4
https://medium.com/codex/creating-a-diversified-portfolio-with-correlation-matrix-in-python-7d7825255a2d
https://www.researchgate.net/publication/252914218_Pearson%27s_Idea_to_test_fitting_in_GLM
https://builtin.com/data-science/portfolio-optimization-python
https://www.kaggle.com/code/amlgroupproject/stock-portfolio-diversification-using-ml
https://www.statology.org/interpret-prz-logistic-regression-output-r/
https://jorgepit-14189.medium.com/portfolio-optimization-in-python-5c442df56ac4
https://www.iexcloud.io/community/blog/portfolio-risk-management-with-python-from-correlation-to-diversification
https://towardsdatascience.com/cryptocurrency-analysis-with-python-buy-and-hold-c3b0bc164ffa
https://link.springer.com/book/10.1007/978-3-030-53743-2
https://www.youtube.com/watch?v=KYm01d2hr6g
https://www.rairo-ro.org/articles/ro/pdf/2021/01/ro200347.pdf
https://developers.refinitiv.com/en/article-catalog/article/portfolio-optimisation-ii
https://www.listendata.com/2019/08/datasets-for-credit-risk-modeling.html
https://www.mathworks.com/help/risk/portfolioecl.html
https://www.bis.org/basel_framework/chapter/CRE/32.htm
https://en.wikipedia.org/wiki/Generalized_linear_model
https://glum.readthedocs.io/en/latest/motivation.html
https://www.foxbusiness.com/personal-finance/high-risk-loans
https://github.com/statsmodels/statsmodels/issues/3101
https://www.mathworks.com/help/risk/compare-pd-using-ttc-and-pit-models.html
https://www.mathworks.com/help/risk/stress-testing-retail-credit-default-probabilities-using-panel-data-1.html
https://www.mathworks.com/help/risk/modeling-probabilities-of-default-with-cox-proportional-hazards.html
https://core.ac.uk/download/48544724.pdf
https://en.wikipedia.org/wiki/Iteratively_reweighted_least_squares
https://en.wikipedia.org/wiki/Dow_Jones_Industrial_Average
https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a
https://www.fmx.nfkatzke.com/Projects/HRP.pdf
https://towardsdatascience.com/generalized-linear-models-9cbf848bb8ab
https://arxiv.org/abs/2204.02735
https://www.statlect.com/glossary/log-likelihood

--

--