Interpreting results from logistic regression in R using Titanic dataset

YS Koh
5 min readJul 25, 2020

--

Logistic regression is a statistical model that is commonly used, particularly in the field of epidemiology, to determine the predictors that influence an outcome. The outcome is binary in nature and odd ratios are obtained by exponentiating the coefficients. While it is easy to find the codes or program manuals on generating the model in the internet, there are not many tutorials that focus on how to interpret the output from the program. In the case of R programming, the summary from the model will not give the desired outputs, which are the odd ratios and 95% confidence interval (95% CI). Additional steps are required to generate them, which may not be presented in these tutorials. Hence, in this article, I will focus on how to generate logistic regression model and odd ratios (with 95% confidence interval) using R programming, as well as how to interpret the R outputs.

For the dataset, we will be using training dataset from the Titanic dataset in Kaggle (https://www.kaggle.com/c/titanic/data?select=train.csv) as an example. In this dataset, Survival status (Survived) is the outcome with 0 = No and 1 = Yes. We will be looking at the predictors that affect the survival status of passengers.

Univariate analysis with categorical predictor

We will first generate a simple logistic regression to determine the association between sex (a categorical variable) and survival status.

model <- glm(Survived ~ Sex, data = titanic, family = binomial)
summary(model)

Interpretation of the model: Sex is a significant predictor to Survival Status (p < 0.05).

However, we would to have the odds ratio and 95% confidence interval, instead of the log-transformed coefficient. Hence, we implemented the following code to exponentiate the coefficient:

exp(coefficients(model))
exp(confint(model))

Interpretation: From the result, the odd ratio is 0.0810, with 95% CI being 0.0580 and 0.112. This means that the odds of surviving for males is 91.9% less likely as compared to females.

Univariate analysis with a continuous predictor

We will now generate a simple logistic regression to determine the association between age (a continuous variable) and survival status.

model <- glm(Survived ~ Age, data = titanic, family = binomial)
summary(model)

Interpretation of the model: Age is a significant predictor to Survival Status (p = 0.0397).

We implemented the following code to exponentiate the coefficient:

exp(coefficients(model))
exp(confint(model))

Interpretation: From the result, the odd ratio is 0.989, with 95% CI being 0.979 and 0.999. This means that for every increase in 1 year of age, the odds of surviving decreases by 1.1%.

Multivariable logistic regression

The table below shows the result of the univariate analysis for some of the variables in the dataset. Based on the dataset, the following predictors are significant (p value < 0.05) : Sex, Age, number of parents/ children aboard the Titanic and Passenger fare. We will use these variables in multivariable logistic regression. This method of selecting variables for multivariable model is known as forward selection.

To generate the multivariable logistic regression model, the following code is implemented:

model <- glm(Survived ~ Sex + Age + Parch + Fare, data = titanic, family = binomial)
summary(model)

Interpretation of the model: All predictors remain significant after adjusting for other factors.

We then implemented the following code to exponentiate the coefficients:

exp(coefficients(model))
exp(confint(model))

Interpretation: Taking sex as an example, after adjusting for all the confounders (Age, number of parents/ children aboard the Titanic and Passenger fare), the odd ratio is 0.0832, with 95% CI being 0.0558 and 0.122. This means that the odds of surviving for males is 91.7% less likely as compared to females. Looking at Passenger fare, after adjusting for all the confounders (Age, number of parents/ children aboard the Titanic and Passenger fare), the odd ratio is 1.02, with 95% CI being 1.01 to 1.02. This means that the odds of surviving increases by about 2% for every 1 unit increase of Passenger fare.

There are also some concepts related to logistic regression that I would also like to explain on

  • AIC (Akaike Information Criterion): This metric explains that relative quality of the model and depends on two factors: the number of predictors in the model and the likelihood that the model can reproduce the data. The lower the AIC value, the better is the model. Comparing the model with only sex as the predictor and the multivariable model, the AIC are 921.8 and 717.4. This means that the multivariable model is a better model as compared to the former.
  • Power of the model: Comparing the model with only sex as the predictor and the multivariable model, it can be seen that the 95% CI for sex in the multivariable model (95% CI: 0.0558 to 0.122) is wider than the univariate model (95% CI: 0.0580 to 0.112). From here, it can be seen that adding more predictors to the model widens the 95% CI and if there is too many predictors, the power of the model to detect significant difference may be reduced.
  • Hosmer-Lemeshow Goodness of fit test: This metrics examines how well the model fit the data. Using the code below:

library(ResourceSelection)
library(dplyr)
survived_1 <- titanic %>% filter(!is.na(Sex) & !is.na(Age) & !is.na(Parch) &
!is.na(Fare))
hoslem.test(survived_1$Survived, fitted(model))

Interpretation: The p-value is 0.1185, suggesting that there is no significant evidence to show that the model is a poor fit to the data. (survived_1 is created so as to drop all the passengers with missing data, as the test could not be performed if there is missing data).

In this article, I have looked at how to obtain odd ratios and 95% confidence interval from logistic regression, as well as concepts such as AIC, power of the model and goodness of fit test.

--

--

YS Koh

I am interested in using R programming for the field of epidemiology and biostatistics.