Unleashing the Potential of Generalized Linear Models: Delving into GLM Theory and Application in R

Dr Shikhar Tyagi
6 min readMar 22, 2024

--

The Generalized Linear Model (GLM) is a powerful extension of the linear regression model that allows for modelling relationships between a dependent variable and a set of predictor variables when the distribution of the dependent variable is not necessarily normal. GLM is a flexible framework that accommodates a wide range of response variable distributions, making it suitable for various types of data. In this discussion, we will explore the theory behind GLMs and demonstrate their application in R using a real-life example.

Example: Predicting Diabetes

Consider the Pima Indians Diabetes dataset, a classic benchmark dataset in predictive modelling. This dataset comprises various health-related predictors such as the number of pregnancies, glucose levels, blood pressure, and more, along with a binary response indicating the presence or absence of diabetes in Pima Indian women.

Mathematical Expression:

The general form of the GLM can be expressed as:

The link function connects the linear predictor to the mean of the response variable. The choice of the link function depends on the nature of the response variable and its distribution.

Components of GLM:

1. Random Component:

  • The response variable Y follows a distribution from the exponential family (e.g., normal, binomial, Poisson).

3. Link Function:

- The link function connects the linear predictor to the mean of the response variable. Different link functions are used based on the nature of the response variable. Common link functions include:

- Identity link for Gaussian (normal) distribution.

- Logit link for binomial distribution.

  • Log link for Poisson distribution.

Assumptions:

1. Linearity: The relationship between predictors and the linear predictor is linear.

2. Independence: Observations are independent of each other.

3. Homoscedasticity: The variance of the response variable is constant across all levels of predictors.

Real-Life Example:

Using the Pima Indians Diabetes dataset, we will fit a GLM to predict diabetes status based on available predictors.

R-code
library(mlbench)
data("PimaIndiansDiabetes2")
Data <- na.omit(PimaIndiansDiabetes2)
model <- glm(diabetes ~ ., data = Data, family = binomial)
summary(model)

Explanation:

- We load the dataset and remove missing values.

- Then, we fit a binomial GLM with `diabetes` as the response variable and all other variables as predictors.

  • The summary provides coefficients, standard errors, z-values, and p-values for each predictor.
R-code
Pred.Prob <- predict(model, type = "response")
Pred.Y <- ifelse(Pred.Prob >= 0.5, "pos", "neg")
Accuracy <- sum(Pred.Y == Data$diabetes) / length(Data$diabetes) * 100

Explanation:

Upon fitting a binomial GLM to the dataset, we obtain a summary of the model, which includes coefficients, standard errors, z-values, and p-values for each predictor. These coefficients indicate the strength and direction of the relationship between each predictor and the likelihood of diabetes. Additionally, the model’s goodness-of-fit statistics provide insights into its overall performance and significance.

After predicting the probability of diabetes using the fitted model, we classify the predictions using a threshold of 0.5. Comparing these predicted values with the actual diabetes status allows us to assess the accuracy of our model. In this example, the model achieves an accuracy of approximately 78.32%, indicating its ability to effectively discriminate between individuals with and without diabetes.

Graphical Representation

plot(model, which = 1)
plot(model, which = 2)
plot(model, which = 3)
plot(model, which = 4)
plot(model, which = 5)
plot(model, which = 6)
library(car)
crPlots(model)

Residual Plot:

Explanation: The residual plot displays the residuals (the differences between observed and predicted values) against the fitted values. It helps detect patterns or trends in the residuals, indicating potential issues with the model’s assumptions, such as non-linearity or heteroscedasticity.

Interpretation: Ideally, residuals should be randomly scattered around zero with no discernible pattern. Patterns or trends in the residuals suggest that the model may be misspecified.

Normal Q-Q Plot of Residuals:

Explanation: The Normal Q-Q plot compares the distribution of the residuals to a theoretical normal distribution. Deviations from a straight line indicate departures from normality.

Interpretation: Ideally, the points on the Q-Q plot should fall along a straight line. Deviations from this line suggest departures from normality, which may impact the validity of statistical inference.

Scale-Location Plot (Squared Residuals vs. Fitted Values):

Explanation: The scale-location plot (also known as the spread-location plot) displays the square root of the standardized residuals against the fitted values. It helps assess whether the variance of the residuals is constant across the range of fitted values.

Interpretation: Ideally, the points should be randomly scattered around a horizontal line with constant variance. A trend or pattern in the plot suggests heteroscedasticity, indicating that the variance of the residuals varies with the level of the response variable.

Cook’s Distance Plot (Influential Observations):

Explanation: Cook’s distance measures the influence of each observation on the model’s coefficients. The plot displays Cook’s distances for each observation, highlighting potentially influential points.

Interpretation: Points with large Cook’s distances are considered influential and may disproportionately affect the model’s coefficients. These points warrant further investigation to determine whether they are genuine outliers or influential observations.

Deviance Residuals vs. Fitted Values Plot:

Explanation: The deviance residuals plot displays the deviance residuals against the fitted values. Deviance residuals are a measure of the model’s goodness of fit.

Interpretation: Ideally, the points should be randomly scattered around zero with no discernible pattern. Patterns or trends in the plot suggest that the model may be misspecified or that the assumed distribution may not be appropriate.

Leverage-Residual Squared Plot:

Explanation: The leverage-residual squared plot displays leverage values (measures of how much an observation’s predictor values differ from the mean) against the squared residuals. It helps identify influential observations.

Interpretation: Points with high leverage and large residuals are considered influential and may disproportionately affect the model’s coefficients. These points warrant further investigation to understand their impact on the model.

Partial Residual Plot:

Explanation: Partial residual plots show the relationship between an individual predictor and the response variable while accounting for the effects of other predictors in the model. They help visualize the relationship between a predictor and the response, considering the influence of other predictors.

Interpretation: The slope of the partial residual plot represents the effect of the corresponding predictor on the response, after adjusting for the effects of other predictors. Patterns or non-linear relationships in the plot suggest potential issues with the predictor’s relationship with the response.

Conclusion:

In conclusion, the application of Generalized Linear Models (GLMs) presents a powerful approach to predictive modelling, particularly in scenarios where traditional linear regression may be insufficient. By leveraging GLMs, we can effectively analyse data with non-normal distributions and non-constant variance, as demonstrated in the prediction of diabetes status using the Pima Indians Diabetes dataset.

Furthermore, this example underscores the importance of understanding the theoretical underpinnings of GLMs, including the linear predictor, link function, and probability distribution, in order to derive meaningful insights and make accurate predictions from our data. As we continue to delve into the realm of data science and predictive analytics, GLMs remain a valuable tool for uncovering patterns, informing decision-making, and advancing our understanding of complex phenomena.

--

--

Dr Shikhar Tyagi

Dr. Shikhar Tyagi, Assistant Professor at Christ Deemed to be University, specializes in Probability Theory, Frailty Models, Survival Analysis, and more.