Ordinal Logistic Regression and its Assumptions — Full Analysis

A detailed Ordinal Logistic Regression analysis on UN’s 2019 World Happiness Report.

Evangeline Lee
evangelinelee
10 min readMay 25, 2019

--

Introduction

The United Nations Sustainable Development Solutions Network has published the 2019 World Happiness Report. Its dataset, named “Chapter 2: Online Data”, can be found and downloaded from their website linked above. The dataset contains data for 136 countries from year 2008 to year 2018 with 23 predictor variables and 1 response variable Happiness Score.

The purpose of the analyses is to discover which variable(s) has the most effect on the Happiness Score rating. To do this, we can collapse the Happiness Score (a 0 to 10 continuous variable, named as Life Ladder in the original dataset) to 3 ordered categorical groups — Dissatisfied, Content, and Satisfied for simplicity. We can also eliminate some variables if they have a lot of missing values or if they are similar in nature. Below is the predictor variables along with their brief descriptions that are selected to conduct the analyses:

1. GDP — Gross Domestic Product per capita
2. Social Support — having someone to count on in times of trouble
3. Healthy Life Expectancy — healthy life expectancies at birth
4. Freedom — freedom to make life choices
5. Generosity — average response of whether made monetary donation to charity in the past month
6. Corruption — average response of perception on corruption spread throughout the government or business
7. Confidence in Government — confidence in national government
8. Household Income — household income in international dollars

A more detailed description about the variables can be found in the Statistical Appendix 1 for Chapter 2 on the World Happiness Report website.

Method

Hypothesis

Since the outcome variable is categorized and ranked, we can perform an Ordinal Logistic Regression analysis on the dataset. We set the alpha = 0.05 and the hypothesis as follows:
H0: there is no statistically significant factors between the variables that influence the Happiness Score
H1: there is at least one statistically significant factor between the variables that influence the Happiness Score

Preliminary Analysis on the Dataset

Below is a short preview of the dataset after some cleaning and wrangling. Only the first five countries’ data are shown here.

Below is the boxplot based on the descriptive statistics (mean, median, max… etc) of the dataset. There were 136 countries in the original dataset but 26 countries got deleted due to having missing value in one or more predictor variables. If these countries are not deleted prior fitting the model, the analysis result might suffer from the impact and thus become invalid. Hence there are only 110 countries data left in the dataset. Although 26 data were deleted, however the remaining sample size of 110 should be sufficient enough to perform the analysis.

From the boxplot above, we see that Happiness Score, GDP, Freedom, Generosity, and Confidence in Government are approximately normally distributed while Social Support, Healthy Life Expectancy, Corruption, and Household Income are a bit skewed.

We can also examine the differences in each variable between each group with a boxplot.

From the above boxplot, it is clear to see that that:

  • The Satisfied group has higher value in GDP, Social Support, Healthy Life Expectancy, and Freedom variables and lower value in Corruption and Household Income variables.
  • The Dissatisfied group has higher value in Household Income variable and lower value in GDP, Social Support, Healthy Life Expectancy, and Freedom variables.
  • The Content group are in between in most variables, however it does have higher value in Corruption and lower value in Confidence in Government variable.

From the general observations above, we can make an educated guess that GDP, Social Support, Healthy Life Expectancy, and Freedom are the most influential factors to the happiness rating. However there is no sound statistical support behind this educated guess. Therefore we should perform the Ordinal Logistic Regression analysis on this dataset to find which factor(s) has statistically significant effect on the happiness rating.

Ordinal Logistic Regression

The reason for doing the analysis with Ordinal Logistic Regression is that the dependent variable is categorical and ordered. The dependent variable of the dataset is Group, which has three ranked levels — Dissatisfied, Content, and Satisfied. Ordinal Logistic Regression takes account of this order and return the contribution information of each independent variable.

One could fit a Multinomial Logistic Regression model for this dataset, however the Multinomial Logistic Regression does not preserve the ranking information in the dependent variable when returning the information on contribution of each independent variable.

Another method that comes in mind when talking about “most important variables” is the Principal Component Analysis (PCA). However PCA doesn’t take account of the response variable, it only consider the variance of the independent variables, so we won’t be using it here as the result could be meaningless.

Before fitting the Ordinal Logistic Regression model, one would want to normalize each variable first since some variables have very different scale than rest of the variables (e.g. GDP and Healthy Life Expectancy). Normalizing the variable basically means that all variables are standardized and each has a mean of 0 and standard deviation of 1. In other words, all variables are converted to be on the same scale. No changes are made to the variables except for rescaling, and this will make the interpretation later a lot easier.

Below is the R code for fitting the Ordinal Logistic Regression and get its coefficient table with p-values. There is a great tutorial written by UCLA’s IDRE here, it explains the concept of Ordinal Logistic Regression and the steps to perform it in R nicely.

# fit the proportional odds logistic regression model
fit <- polr(Group ~ GDP + Social.Support + Healthy.Life + Freedom + Generosity + Corruption + Confidence.in.Govt + Household.Income, data = happy, Hess = T)
# get the p-values
# store the coefficient table
ctable <- round(coef(summary(fit)), 4)
# calculate and store p-values
p <- pnorm(abs(ctable[, "t value"]), lower.tail = F) * 2
# combine coefficient table and p-values table
(ctable <- cbind(ctable, "p value" = round(p, 4)))

The last two rows in the coefficient table are the intercepts, or cutpoints, of the Ordinal Logistic Regression. These cutpoints indicate where the latent variable is cut to make the three groups that are observed in the data. However the cutpoints are generally not used in the interpretation of the analysis, rather they represent the threshold, therefore they will not be discussed further here.

Now we can tell which variables are the statistically significant from the coefficient table by simply compare the absolute value of the coefficients. The variable with the largest value is the most influential factor. In this case, these variables are Social Support (1.4721), Corruption (1.0049), and GDP (0.8619). These variables also have smaller p-values compare to other variables. However since alpha=0.05, only Social Support (0.0254) and Corruption (0.0328) have p-value less than 0.05, and thus only these two variables are statistically significant. Since there is at least one variable that is statistically significant, the null hypothesis (H0) is rejected and the alternative hypothesis (H1) is accepted.

One thing to note is that the coefficients in the table are scaled in terms of logs and it reads as “for a one unit increase in GDP, the log of odds of having higher satisfaction increases by 0.8619”. This is difficult to interpret, therefore it is recommended to convert the log of odds into odds ratio for easier comprehension. One can also calculate the 95% confidence intervals for each coefficient.

# get confidence intervals
# profiled CIs
ci <- round(confint(fit), 4)
# log odd coefficients
or <- round(coef(fit), 4)
# convert coefficients into odds ratio, combine with CIs
round(exp(cbind(OR = or, ci)), 4)

Above output is the coefficient parameters converted to proportional odds ratios and their 95% confidence intervals. The interpretation for such is “for a one unit increase in GDP, the odds of moving from Unsatisfied to Content or Satisfied are 2.3677 times greater, given that the other variables in the model are held constant”.

The two most statistically significant variables have proportional odds ratios as 4.3584 (Social Support) and 0.3661 (Corruption). These will read as “for a one unit increase in Social Support, the odds of moving from Unsatisfied to Content or Satisfied are 4.3584 times greater, given that the other variables in the model are held constant”; and “for a one unit increase in Corruption, the odds of moving from Unsatisfied to Content or Satisfied are 0.3661 times greater, given that the other variables in the model are held constant”. In other words, the higher the Social Support is, the higher the Happiness Score is; the higher the Corruption is, the lower the Happiness Score.

Ordinal Logistic Regression Assumptions

Since the Ordinal Logistic Regression model has been fitted, now we need to check the assumptions to ensure that it is a valid model. The assumptions of the Ordinal Logistic Regression are as follow and should be tested in order:

  1. The dependent variable are ordered.
  2. One or more of the independent variables are either continuous, categorical or ordinal.
  3. No multi-collinearity.
  4. Proportional odds

Multi-Collinearity

We know that our dataset satisfied assumption 1 and 2 (see dataset preview earlier). Therefore we will now check for assumption 3 about the multi-collinearity, begin by examine the correlation plot between each variable.

# correlation plot
happy.var <- happy[, c(3:10)]
ggpairs(happy.var, title = "Correlation Plot between each Variable")

From the correlation plot one can see that GDP, Healthy Life Expectancy, and Social Support have a higher correlation level at around 0.8. Although correlation coefficient of 0.8 indicates there is a strong linear relationship between the two variables, however it is not that high to warrant for a collinearity. Therefore a Variance Inflation Factor (VIF) test should be performed to check if multi-collinearity exists.

Since an Ordinal Logistic Regression model has categorical dependent variable, VIF might not be sensible. To solve this issue, we normally would need to transfer categorical variables to a numeric dummy variable. However, because I actually have the “Happiness Score” numeric variable, I don’t need a dummy variable. I can fit a multi-linear regression and calculate the VIF directly using the Happiness Score.

# check VIF
fit2 <- lm(scale(Happiness.Score) ~ GDP + Social.Support + Healthy.Life + Freedom + Generosity + Corruption + Confidence.in.Govt + Household.Income, data = happy)
vif(fit2)

The general rule of thumbs for VIF test is that if the VIF value is greater than 10, then there is multi-collinearity. Since non of the VIF values are greater than 10 according to above output (not even close to), we conclude that there is no multi-collinearity in the dataset and assumption 3 is met.

Proportional Odds

Now we should conduct the Brant Test to test the last assumption about proportional odds. This assumption basically means that the relationship between each pair of outcome groups has to be the same. If the relationship between all pairs of groups is the same, then there is only one set of coefficient, which means that there is only one model. If this assumption is violated, different models are needed to describe the relationship between each pair of outcome groups.

# testing parallel regression assumption using Brant's test
brant(fit)

Above is the Brant Test result for this dataset. We conclude that the parallel assumption holds since the probability (p-values) for all variables are greater than alpha=0.05. The output also contains an Omnibus variable, which stands for the whole model, and it is still greater than 0.05. Therefore the proportional odds assumption is not violated and the model is a valid model for this dataset.

Conclusion

The preliminary analysis and Ordinal Logistic Regression analysis were conducted for 2019 World Happiness Report dataset. Based on the result of the analysis, we can conclude that Social Support and Corruption are the main influential factors that affect the Happiness Score rating in 2018. For any one unit increase in Social Support, the odds of moving from Unsatisfied to Content or Satisfied are 4.3584 times greater; for any one increase in Corruption, the odds of moving from Unsatisfied to Content or Satisfied are multiplied by 0.3661, which literally means a great decrease. Another variable, though not statistically significant enough but still worth noting, is the GDP. For any one unit increase in GDP, the odds of moving from Unsatisfied to Content or Satisfied are 2.3677 times greater.

References

ORDINAL LOGISTIC REGRESSION | R DATA ANALYSIS EXAMPLES. (n.d.). Retrieved May 09, 2019, from <https://stats.idre.ucla.edu/r/dae/ordinal-logistic-regression/>

Rawat, A. (2018, February 20). Ordinal Logistic Regression. Retrieved May 09, 2019, from <https://towardsdatascience.com/implementing-and-interpreting-ordinal-logistic-regression-1ee699274cf5>

ORDINAL REGRESSION. (n.d.). Retrieved May 09, 2019, from <https://www.st-andrews.ac.uk/media/capod/students/mathssupport/ordinal logistic regression.pdf>

Blissett, R. (2017, November 26). Logistic Regression in R. Retrieved May 09, 2019, from <https://rpubs.com/rslbliss/r_logistic_ws>

--

--

Evangeline Lee
evangelinelee

Data Analyst at National Debt Relief; MS Applied Statistics from University of Kansas. Get in touch: https://www.linkedin.com/in/evangelinelee