Statistical Analysis on Cancer Patients Data in Python

Koketso Mangwale
7 min readJan 13, 2024

During a four-week Data Analysis Industry Training program, I had the opportunity to collaborate on a team project that involved analyzing a dataset containing information about cancer patients, each uniquely identified by a patient ID.

As highlighted by the World Health Organization (WHO), lung cancer was the leading cause of cancer-related deaths in 2020, accounting for approximately 1.80 million deaths worldwide. Given that cancer is influenced by a complex interplay of genetic, lifestyle, and environmental factors, it becomes crucial to pinpoint elements that strongly correlate with different levels of cancer.

This analysis aims to uncover valuable insights to facilitate more effective and tailored approaches to lung cancer prevention and intervention.

Data Description

In the given dataset there is a Level column indicating the level of cancer categorized as low, medium, and high. There is also 21 risk factors which are ordinal in nature and are scored on a scale of 1 to 9, with 1 indicating less severe and 9 indicating the most severe for a particular factor.

Level of Cancer and Risk Factors Data for 5 Patients

After cleaning the data to address data quality issues, I explored the Level column along with the 8 appropriate risk factors for 899 patients. I used python libraries for statistical tests and regression analysis(statsmodel libraries).

Summary Statistics for Risk Scores for the 8 Cancer Symptoms
Plot Showing Cancer Risk Factors Score Descriptive Statistics — Box plots are not symmetric indicating that the scores data distribution is skewed.

Checking for Normality

I then performed a Kolmogorov-Smirnov test on the 899 patients to check if the data follows a normal distribution and got the below results:

Plots Showing the Cancer Risk Factors Score Frequencies — The shape of the histograms is showing skewed data

The above visuals and statistical tests (Kolmogorov-Smirnov: N= 899, t-statistic of approximately 1, p < 0.01) indicate that the scores for the risk factors are not approximately normally distributed for all the levels of cancer.

Statistical Analysis

I investigated whether the combined risk factor scores significantly influenced the likelihood of different levels of lung cancer risk.

Hypothesis: There is no significant relationship between the combined set of risk factor scores and the probabilities of different categories of the dependent variable (Level).

Create the regression model

Assumptions for Regression Model:

  • The response variable, ‘Level,’ is an ordered categorical variable ranging from 1 to 3.
  • The predictor variables, representing symptom (risk factor) scores ranging between 1 and 9, are also ordinal. A score of 1 indicates less severity, while 9 denotes the most severe condition.
  • Linearity assumption/Proportional odds assumption: i.e. the effect of the predictors on the odds of a higher level is assumed to be constant across all levels of the response variable.
  • No multicollinearity: As seen on the heatmap, the below symptoms are highly correlated and will not be included in modelling.
  • — Dust allergy with Occupational Hazards(0.88) and Alcohol use(0.84). Alcohol use will be included in modeling.
  • — Occupational Hazards with Alcohol use(0.85).
  • Chest pain with chronic Lung Disease(0.80) and Balanced diet(0.80). Balanced diet will be excluded in modeling.
  • — Occupational Hazards with chronic Lung Disease(0.86), Alcohol use(0.85) and Dust Allergy (0.88). chronic Lung Disease will be included in modeling.
Heatmap showing the correlation between predictors and the response variable (Level) — selecting relevant independent variables

Here are the relevant selected variables:

Dependent Variable: ‘Level’

Independent Variables: ‘Obesity’, ‘Coughing of Blood’,
‘Alcohol use’, ’Passive Smoker’, ‘chronic Lung Disease’,
‘Fatigue’, ’Chest Pain’, ‘Shortness of Breath’

Logistic Ordinal Regression

Then, I modeled the relationship between multiple independent variables represented by risk factor scores and the probabilities of different categories of the dependent variable, Level. Ordinal logistic regression was used to account for the ordered nature of the dependent variable.

Results:

Goodness of fit of model:

  • The log-likelihood value is -197.35, the higher the value, the better the fit of the OrderedModel for the observations.
  • The Likelihood Ratio (LLR) p < 0.05, therefore the model is well fit.
#llr p-value = 0.0
ordinal_model.llr_pvalue

Coefficients:

  • Chest pain has a negative coefficient, indicating that an increase in the score of the risk factor is not associated with the likelihood of higher levels of cancer.
  • Coughing of blood, Passive Smoker, Fatigue and Shortness of Breath have coefficients greater than 1, indicating that an increase in the risk factor score is associated with an increase in the Level variable.
  • If the coefficients have different signs or are not statistically significant, it indicates a non-linear relationship.

P-values and Significance:

  • All the factors have coefficients with p < 0.05, so there is a significant relationship between the risk factors and the cancer level.
  • All the factors have coefficients with p < 0.01, further showing that there is a highly significant relationship between the risk factors and the cancer level.

Overall, the output suggests that the predictors in the model have statistically significant relationships with the ordered categorical dependent variable (Level) at 95% confidence interval.

Intercepts/Thresholds:

Using odds by exponentiating a coefficient:

  • The odds of being in a high level vs the combined low level(=1) and medium level(=2) of cancer categories are exp(19.9573) ≈ 464 884 711.243.
  • The odds of being in a low level vs higher levels of cancer is exp(2.3972) ≈ 10.9923546573.

When predictor variables are held constant, the odds of being in lower levels of cancer (Low and Medium combined) vs. the High level are approximately 464.9 million times higher than being in the High level.

  • This indicates that there is a significantly higher odds of moving from the combined low and medium levels to the high level.

Meanwhile, the odds of being in higher levels of cancer (Medium and High combined) vs. the Low level are approximately 10.99 times higher than being in the Low level.

  • This suggests that there is a less dramatic change in odds for being in the low level compared to higher levels.

The probability that a patient has high levels of lung cancer

Using the cumulative logistic function to calculate the probabilities:

  • P( ≤ 2) = 1 / (1 + exp(-19.9573)) ≈ 1. This means that the probability of being in level 1 or 2, given the intercept, is approximately 1.
  • P( 3 ) = 1 / (1 + exp(-2.3972)) ≈ 0.91661354015. This means that the probability of being in level 3, given the given the threshold for the transition from level 2 to level 3, is approximately 0.916 or 91.7%.

Does a high cumulative score on risk factors significantly influence the probability that a patient exhibits high levels of lung cancer?

Calculating the predicted probability of high level (level = 3) category using cumulative logistic regression:

Patient 1

  • Obesity: 8
  • Coughing of Blood: 7
  • Alcohol use: 5
  • Passive Smoker: 2
  • chronic Lung Disease: 3
  • Fatigue: 7
  • Chest Pain: 4
  • Shortness of Breath: 8

Predicted score for Patient 1 = 0.7855(8) + 1.0432(7) + 0.3849(5) + 1.3283(2) + 0.7303(3) + 2.0481(7) — 1.0752(4) + 1.1060(8) = 39.2423

P( 3 ) = 1 / (1 + exp(-39.2423)) 1

Patients 2

  • Obesity: 1
  • Coughing of Blood: 1
  • Alcohol use: 5
  • Passive Smoker: 2
  • chronic Lung Disease: 1
  • Fatigue: 5
  • Chest Pain: 1
  • Shortness of Breath: 1

Predicted score for Patient 2= 0.7855(1) + 1.0432(1) + 0.3849(5) + 1.3283(2) + 0.7303(1) + 2.0481(5) — 1.0752(1) + 1.1060(1) = 17.4114

P( 3 ) = 1 / (1 + exp(-17.4114) ≈0.999999973

Findings

  • With the above combined set of risk factors, there is higher odds of being in the high level (3) of cancer compared to the lower levels of cancer.
  • There is also a suggestion of a strong likelihood of individuals with specified characteristics transitioning to the highest category (level 3) of lung cancer.
  • In both cases of the sample patients, the predicted probabilities for being in level 3 are very close to 1, suggesting a strong prediction that these patients are likely to be in the high level of cancer according to the features provided in the model.

Conclusion

The analysis suggests that the identified risk factors significantly influence the likelihood of different levels of lung cancer. Reduction in these risk factors may potentially lower the risk and severity of lung cancer, as indicated by the statistical model and probability calculations.

So, programs aimed at reducing risk factors associated with symptoms, particularly those with coefficients greater than 1, may contribute to lowering the overall risk and severity of lung cancer.

Implementing individualized assessments of patients based on their specific risk factor scores can help identify high-risk individuals who may benefit from closer monitoring, early detection, and personalized intervention strategies.

Public health campaigns can focus on educating the population about the modifiable risk factors such as exposure to secondhand smoke (passive smoking), obesity, alcoholism and fatigue. Promoting behavior changes through targeted recommendations can contribute significantly to reducing the overall burden of lung cancer. Such recommendations may include:

  • Avoiding second hand smoke
  • Maintaining a healthy body weight
  • Avoiding or reducing consumption of alcohol
  • Having enough sleep to avoid fatigue

“If You Torture the Data Long Enough, It Will Confess”. — Ronald Coase

References

--

--

Koketso Mangwale

An aspiring data analyst documenting my progress | Data Analysis | Python | SQL | Problem Solving