Extending the Analysis: Predicting Lung Cancer Risk through Data Analysis

Larissakimberly
INST414: Data Science Techniques
3 min readMay 16, 2024
  • Introduction:

Building upon the foundation laid in my previous Medium post titled “Exploring Medical Data: Predicting Diabetes Risk through Clustering Analysis,” where I delved into the application of clustering analysis to predict diabetes risk, I have now embarked on an extension of that analysis. In this iteration, I aim to broaden the scope by examining another critical health concern: lung cancer. While diabetes and lung cancer may differ significantly in their etiology and manifestation, both underscore the importance of leveraging data-driven approaches to understand and mitigate serious health risks.

  • Motivating Question:

The central question driving this analysis is how clustering analysis can aid in predicting the risk of lung cancer in individuals based on demographic and lifestyle factors. By extending our exploration beyond diabetes to encompass lung cancer, we aim to provide actionable insights that can inform healthcare decisions and interventions targeted at reducing the burden of this life-threatening disease.

  • Data Description:

The dataset utilized in this analysis was sourced from Kaggle’s lung cancer prediction dataset. It encompasses crucial attributes like age, gender, smoking habits, and symptoms associated with lung cancer. Data preprocessing techniques were applied to handle missing values and encode categorical variables, ensuring the dataset’s suitability for clustering analysis.

  • Course Methods:

Similar to the approach adopted in the diabetes risk prediction analysis, this project leverages clustering methodologies, particularly KMeans clustering. The objective remains to segment the dataset into distinct clusters based on selected features related to lung cancer risk factors. By applying clustering algorithms to the lung cancer dataset, we aim to uncover hidden patterns and risk profiles that can inform targeted interventions.

  • Analysis:

Following data preprocessing, KMeans clustering was employed to identify clusters of individuals with similar lung cancer risk profiles. Through iterative refinement of cluster centroids, we delineated distinct risk profiles characterized by demographic and lifestyle factors. The analysis aimed to elucidate the underlying structure of the data and identify actionable insights for healthcare decision-making.

  • Insights:

The clustering analysis unveiled three distinct clusters, each representing a unique lung cancer risk profile. By examining the centroids of these clusters, we gained insights into the demographic and lifestyle characteristics associated with different levels of lung cancer risk. These insights can guide healthcare professionals in tailoring preventive measures and interventions based on individual risk profiles.

  • Findings:

The analysis revealed three distinct clusters with varying risk profiles for lung cancer. Cluster 0 consisted of individuals with moderate risk factors, including older age and higher prevalence of smoking. Cluster 1 comprised individuals with relatively lower risk factors, while Cluster 2 represented individuals with the highest risk, characterized by older age, higher prevalence of smoking, and more symptoms related to lung cancer.

  • Implications:

Healthcare practitioners can use the insights from clustering analysis to tailor preventive strategies and interventions for individuals based on their specific risk profiles. For example, individuals in Cluster 2, with the highest risk profile, may benefit from targeted screening programs and lifestyle interventions to reduce their risk of developing lung cancer.

  • Conclusion & Limitations :

In conclusion, this extended analysis exploring the prediction of lung cancer risk through clustering analysis has provided valuable insights into the potential application of machine learning techniques in predicting serious life-threatening diseases. By building on the methodology and insights from the initial diabetes risk analysis, we have demonstrated the versatility of clustering analysis in uncovering patterns within medical data. However, it is important to acknowledge the limitations of this study, including data completeness, sample representativeness, and algorithm sensitivity. Addressing these limitations through further research and data refinement will be crucial for enhancing the accuracy and applicability of predictive models in healthcare decision-making.

  • Github:

https://github.com/larissakimberly4/INST-414-spr22.git

  • Data source:

https://www.kaggle.com/datasets/erdemtaha/cancer-data

--

--