Predicting Lung Cancer

Michael Campbell
INST414: Data Science Techniques
3 min readMay 12, 2022
Photo by Andres Siimon on Unsplash

Have you ever wondered if self-diagnosis ever gives correct results? Given the emergence of the internet, everyone's first thought after feeling sick is to lookup the symptoms on the internet. I thought it would be interesting to find predict someone who has lung cancer based on their symptoms. Just to preface this, something like this will probably never replace a doctor but it is useful to encourage people to go to the doctor. While looking through Kaggle, I found a survey dataset which asks people questions regarding the symptoms they have or if they do things that are said to cause lung cancer. This data set has 15 features and 309 entries. With this data, I hope to create a logistic regression to see if we can predict if someone has lung cancer based on the following features.

To start, we want to check for null values, duplicates and convert the objects into an INT since they are currently strings.

print(df.isnull().sum())df = df.dropna()print('Duplication Count:', df.duplicated().sum())df = df[~df.duplicated()]df['GENDER'] = df['GENDER'].replace({'M' : 1, 'F' : 0})df['LUNG_CANCER'] = df['LUNG_CANCER'].replace({'YES' : 1, 'NO' : 0})print(df['GENDER'].value_counts())print(df['LUNG_CANCER'].value_counts())

After running this, we find that there are no null values in these rows but there are 33 duplicate rows. Once those rows have been dropped, we can see the general distribution of all the features in the dataset.

In order to predict lung cancer, we need to create a new data Frame that will only have symptoms. The symptoms listed in this dataset are allergy, wheezing, chest pain, difficulty swallowing, yellow fingers, coughing, shortness of breath and fatigue. With these features, we can now put split the data and run the logistic regression from sklearn on the data.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 5)Logistic = LogisticRegression(max_iter = 2000)Logistic.fit(X_train,y_train)train_score = Logistic.score(X_train,y_train)test_score = Logistic.score(X_test, y_test)print("Testing Data Score:", test_score)

This gives us the following results:

Training Data Score: 0.8985507246376812

Testing Data Score: 0.855072463768116

After running this, I wondered if a K Neighbors Classifier would give me the same answer, so I ran it with the same data split and got:

Training Data Score: 0.9178743961352657

Testing Data Score: 0.8115942028985508

This led me to believe that there was a problem with the data since that was a pretty big gap between the training and test score. Looking back at the dataset, I found that there is not an even distribution of people that have cancer. There are 238 who have cancer and 38 who don’t have cancer. After some thought, I decided it would be better to proceed with the regression rather than the classification. The only way to potentially combat uneven distribution could be to complete some kind of reduction of the dataset however one of my concerns is that the dataset is too small. Looking at the point where the logistic regression got the wrong answer, all the top five wrong answers were all people who didn’t have cancer.

My conclusion is that this is due to the low non-cancer participants in the dataset. Models like this can definitely be used to predict someone has lung cancer but there is always the risk of false negatives. Doctors could possibly use a survey like this to see if a patient should come in sooner for a checkup or possibly for someone to self-diagnose themselves, but I don’t think that would be as viable of a use case.

--

--