Using Classification to Predict Breast Cancer Diagnoses
Effectively detecting cancer in patients is an extremely important task that can drastically affect people’s lives. Using machine learning to help with cancer detection could bring great change to the accuracy of how well cancer is found. This article will focus on breast cancer detection, which is one of the most common types of cancer. The question we are trying to answer is: can we predict breast cancer diagnoses using numerical features of the tumor? Classification will be used to answer this question.
The main stakeholders of this project are doctors and the people being tested for cancer. For a doctor’s office or hospital, it is important that they are accurate in their cancer diagnoses. Getting these cancer detections wrong could have ramifications on their practice and their patients. If they keep misdiagnosing cancer, people will be hesitant to go there for help. For the patients, the results of a cancer diagnosis could mean life or death. If something is missed, they could keep going on with their lives while having cancer. It could be too late by the time this missed diagnosis is eventually found. The decision that this analysis will influence is whether a patient should be treated for cancer or not. What should be given is an accurate way to tell if a patient may have cancer or not, and if more testing should be done or if treatment should be started.
The data that would help answer this question is patient data from people that were both diagnosed and not diagnosed with cancer. The fields that would be needed are things like the size and shape of the tumor, symptoms, and other measurable aspects of the cancer. We would then want to know whether this patient was diagnosed with cancer or not. This type of dataset would answer our question because we will be able to use these numerical features and final diagnosis to train our detection model. This model should be able to accurately tell whether a set of symptoms or numerical descriptions are cancer or not.
To get this specific data, I was able to find a breast cancer study from the University of Wisconsin. This dataset, which was donated to the UC Irvine Machine Learning repository in 1995, includes more than enough data needed to complete my analysis. The researchers were able to measure and find numerical attributes of breast cancer cells. In total, there are 30 numerical features paired with each patient, with 569 patients. These features include things like radius, texture, perimeter, area, and smoothness. Three different parts, or planes, of the cell were measured, so there are several measurements for some of these features. For each plane, 10 measurements were taken. It also includes whether the cells were cancerous or not, which was measured as either malignant or benign.
This dataset actually did not require any cleaning. It comes from a dataset that seemed to already have been cleaned. There are no features that seem extra or unneeded. The repository that I got it from also stated that there are no missing values, so we don’t have to worry about taking those out. I ran into no errors or problems with the values or structure in the dataset, so it seems that it is already clean. This dataset is actually part of sklearn, so I didn’t need to download any files, I could just call it straight from the sklearn library. The only problem I ran into was adjusting the max_iter value, which is the maximum amount of iterations the model will perform. This can affect the computational resources taken by the model. Once I adjusted this, I got no errors or warnings with the data.
To do this analysis, I will be using a classification model. The reason this is being used is because we are predicting whether a person has cancer or not. This will be a binary classification. In the terms of the dataset, we will be determining if the given cells are benign or malignant. Benign was measured as 0, and malignant as 1.
We will be using all of the features corresponding to each patient in the dataset. As described above, each patient’s cells were measured with a set of 30 features. There are 3 different planes, and each has ten features. These features are radius, texture, perimeter, area, smoothness, compactness, concavity, concave points, symmetry, and fractal dimension. These will be used in our supervised model.
After training the model on the data, I noticed that there is a small margin of error. Below are five examples of where the model was wrong:
- Patient 20: Predicted cancer, but no cancer diagnosis
- Patient 58: Predicted cancer, but no cancer diagnosis
- Patient 77: Predicted cancer, but no cancer diagnosis
- Patient 82: Predicted cancer, but no cancer diagnosis
- Patient 112: Predicted no cancer, but cancer diagnosis
In the above examples, we can clearly see that the model had trouble with predicting false positives. This may seem bad, but in the case of cancer diagnosis, it is better to be predicted as a false positive rather than being a false negative and being left with cancer. These false positive instances most likely had measurements that mimicked those of actual positive cases. This is similar for the false negative prediction, which most likely had the characteristics and measurements similar to a non-cancerous cell.
Overall, this model does answer my question. With an f1-score of 0.97, we can conclude that it can predict the diagnosis of breast cancer using numerical cell measurements and descriptions. The only caveat to this f1-score is that there are still 3% of the people who are mis-diagnosed. This false diagnosis could be extremely impactful to a person’s life, so people using this model would need to weigh the importance and impact of this 3%.
The biggest limitation to this analysis was the size of the study. Over 500 cancer samples is a significant amount, but if we want to have a truly accurate cancer predictor, we would need many more samples. The data also comes from a study done in 1993, so the data collection may not be as accurate as it would be today. Technology has improved tremendously, so there could be new and more accurate ways in which these types of data are measured.
Data Sources: https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html and https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic
GitHub Link to Code: https://github.com/ltwalsh/walshINST414module6