Survival After Cancer Diagnosis

Jonathan Mendoza
Can It Be Predicted?
4 min readOct 25, 2019

According to the CDC, the number of people diagnosed with cancer, and deaths due to cancer, have steadily increased over the last few decades. Can we use machine learning to identify the individuals at a higher risk of an early death?

Every year, about 0.45% of the United States’ ever-growing population is diagnosed with cancer. This means that by the end of 2020 there will be roughly 1.5 million new cases of cancer, and about 600 thousand cancer related deaths in the U.S.. The uncertainty that comes with a cancer diagnosis leaves families in disarray. “How long do I have?” is one of the questions that a doctor may not be able to answer precisely, due to many unknown factors and the limits of human cognition. Over-treatment and over-medication of patients is also a serious problem, but this could potentially be solved by tracking and prioritizing treatment based on data, instead of what is conventional.

Photo by Kendal on Unsplash

Using machine learning algorithms, it may be possible to help doctors make decisions based on the many observations they obtain from a patient. By using these techniques we can chip away at the human error inherent in a diagnosis, and provide doctors the means to set data-backed expectations for their patients. Using data to flag individuals who are at high risk of mortality could help doctors reinforce and/or guide their decision making and recommendations to patients.

The goal of this research is to see if its possible to estimate the life-expectancy range of a patient. The data-set used in this study was obtained from the National Cancer Institute (a division of the National Institutes of Health) and the period between 1985–2005 were used in the analysis. Using an ensemble of machine learning algorithms, I achieved a prediction accuracy of 70%. When referring to cancer statistics it is common to talk about the 5 year survival rate, so the target ranges for prediction were: less than five years (“Low”), 5–15 years (“Medium), and 15+ years (“High”).

It was found that age is a big factor when it comes to the survivability of a patient. The older a patient is at the time of diagnosis, the lower their life expectancy. There is clearly an inverse relationship between age and life-expectancy, so it is important that we attempt to collect as much data on a person during regular checkups. Gathering and analyzing this data at an earlier age will allow for the early detection and treatment of cancer.

This heatmap shows that ages 40–60 are more often diagnosed with cancers of all types than any other age groups. Conventional knowledge states that you should be screened for certain cancers starting at age 50, but by that time, if cancer is present, it may have progressed much further than if we start screening individuals starting from ages 30–40. Many of the factors which helped my model predict the survivability of a patient are related to the extent of the cancer’s progression, which leads me to believe that cancer screening should be performed earlier and consistently. Our vigilance against cancer is the best defense; the earlier a cancer is caught and treated, the better the outlook for the patient’s quality of life, and life-expectancy.

Interesting findings:

— Females have a slightly higher average survival

— More than 1 tumor significantly decreases life expectancy

— Patients on whom cancer-directed surgery was performed had a higher life- expectancy; although, surgery at some cancer sites may still correlate to an increased risk to life-expectancy

In the not-too-distant future, we will be using machine learning and artificial intelligence to help us make smarter decisions in every facet of life, including improving our healthcare system. Given enough data points, I believe it is possible to predict whether a person might be diagnosed with cancer, but that’s for another “Can It Be Predicted?”

You can find the code for my project on github.

--

--