Impact of Tumor Characteristics on Breast Cancer Types

Anapeshku
INST414: Data Science Techniques
12 min readMay 16, 2024

Question Motivation and Stakeholders

Breast cancer is the most common form of cancer in the world. According to the Center for Disease Control, roughly 240,000 cases of breast cancer are diagnosed annually in women and 2,100 cases occur annually in men. The American Cancer Society predicts that this number will rise to 310,720 cases in 2024, with an estimated 42,250 cases resulting in death. The most efficient way to lower the mortality rate for breast cancer is to monitor and treat the affected area during the earliest stages. To do so, patients should take notice of any changes or discomfort in their chest area and report this information to their primary caregiver. From this point, if it is believed that the abnormality in the chest could be cancer, then a mammogram and further testing can be done to reach a diagnosis. Sometimes, when a lump is found in that area, it is just a benign growth that will not spread and has a low likelihood of causing serious issues. However, if a growth is malignant, more intensive treatment is typically needed.

In order to inform this process, multiple questions will be examined through different analytical techniques such as basic exploratory data analysis, similarity metrics, and classification. When conducting the exploratory analysis, the question that will be answered is whether there is a correlation between the size of a lump and the chances of it being malignant. Since the first step in determining if a lump is cancerous is typically done through self-observation, some women may disregard the lump if it is small, prioritizing larger abnormalities in the breast tissue. However, this mentality could lead to a malignant tumor progressing further in stages which results in a higher chance of death. Additionally, breast cancer can affect people regardless of gender and has multiple stages that are typically categorized by the size of the tumor and other factors. Exploring if there is a relationship between size and malignancy can help inform people about when to get a lump/abnormality in the skin checked out and when to be concerned about finding these growths in the skin. With this knowledge, people can make more informed health decisions, especially when doing self-assessments before visiting an oncologist.

Suppose someone decides to take the next steps and visit their oncologist. In that case, the classification and similarity metric portion of the analysis is meant to assist the doctor in the efficiency of the diagnostic process. Since the exploratory analysis portion focused heavily on characteristics of the tumor that would impact its size, for the process of informing oncologists and other medical professionals who interact with tumor patients about the potential correlation between tumor characteristics and type, the classification and similarity metric study portion incorporated other factors to produce more accurate and comprehensive results. When conducting the similarity metric research, the question posed asks: which tumors, based on their characteristics, are closest in Euclidean distance to a malignant observation and benign observation? By examining the tumors that are most similar to a malignant example, the characteristics of these tumors can be studied to monitor if there is any correlation between tumor type and a specific feature. Through this analysis, oncologists can factor in the outcome to inform their diagnostic process. For example, if it is found that all of the tumors that are most similar to the malignant case have similar roughness metrics, then it can be determined that tumor texture may have an impact on the type of tumor that a patient is suffering from. An oncologist can then use this information when examining a tumor and, if that tumor has the specific roughness metric that may show malignancy, it would influence their decision when deciding the likelihood of malignancy. Additionally, this process could also apply to the benign example. If it is found that all the cases that are most similar to the benign reference example have a similar characteristic, then when a doctor examines a tumor and finds that characteristic, it could support a benign diagnosis. By incorporating the characteristic information into the diagnostic process, an accurate prognosis can be reached more quickly. Speed and efficiency are crucial during the diagnostic process. If a tumor is quickly diagnosed as malignant, then treatment can begin earlier, which leads to a higher chance of it being effective. The classification portion is meant to assist oncologists similarly to the similarity metric question. When conducting the classification study of the analysis, the question being asked is: can tumor type be correctly classified given characteristics such as smoothness, compactness, concavity, texture, etc? If it is found that tumor type can be accurately diagnosed by the given metrics, then oncologists would be able to monitor those characteristics in a patient’s tumor and determine the type.

Data type, collection, and cleaning

When attempting to answer the above questions, a dataset including various tumor characteristics and the diagnosis of that tumor type (benign or malignant) would be needed. This is because all of the questions aim to study some type of relationship between characteristic and type, so both of those factors would be necessary to have in the dataset. In order to answer the first proposed question, the information needed can be found through this dataset: https://www.kaggle.com/datasets/yasserh/breast-cancer-dataset, which was found from Kaggle. The information includes columns stating if the case was malignant or benign, the radius of the lobes, the mean surface texture (represented numerically), the perimeter, area, smoothness (represented numerically), compactness, concavity, and symmetry of the tumor. For the purposes of this question, size will be measured using radius. This is to keep consistent with pathological testing due to tumors typically being measured using diameter in that field. The dataset includes 570 cases and did not indicate any potential mis-inputted values or corruption, so no cleaning was necessary. All of the values present matched their variable type and did not show any extreme variation that would suggest removing the row/column from the table would be required. In order to expand on the exploratory analysis and answer the questions that examined similarity metrics and classification, a dataset from sklearn was used. This information can be found here: https://archive.ics.uci.edu/dataset/17/breast+cancer+wisconsin+diagnostic the dataset included additional characteristics from the one found on Kaggle, which is why it was used for the extended analysis piece. These features include concave points, symmetry, etc. Similar to the Kaggle dataset, no cleaning was necessary due to the CSV file not having any corruption or mislogged values. The file was examined for potential extreme outlier values that would indicate the previously mentioned issues, however none were found. The data was able to be imported and analysis could be begun directly after.

Methods Used

The project builds upon methods that particularly focus on surface-level exploratory analysis of data. To dive deeper, we are incorporating KNN classification and similarity metric analysis.

KNN classification will aid in predicting tumor diagnosis based on various characteristics. By utilizing this method, we aim to understand how different tumor attributes correlate with malignancy probabilities. Additionally, similarity metrics will be employed to compare benign and malignant tumors, seeking defining characteristics that distinguish the two groups.

Furthermore, we plan to integrate a dataset from sklearn, adding real-life cancer cases to the original dataset and incorporating additional characteristics. This expansion enables a more comprehensive analysis, exploring correlations between tumor features and malignancy probabilities.

Analysis, Models, and Evaluation

As mentioned before, we employed multiple analytical techniques for our analysis. First, we conducted a basic exploratory data analysis, which involved calculating the mean, standard deviation, and minimum and maximum values of the radius for both malignant and benign tumors. The purpose of doing this initial analysis was to help us understand what the characteristics of malignant and benign tumors are, to help us see patterns in the data, and also to determine whether there is a relationship between tumor size and diagnosis. To conduct this analysis, we first added the malignant and benign tumors into two separate lists. Then, we iterated through each list to calculate the mean, SD, minimum, and maximum and display those statistics in a dataframe.

Next, to address our motivating question regarding the correlation between tumor characteristics and the likelihood of malignancy, we conducted an analysis employing a K-Nearest Neighbors (KNN) classification approach alongside similarity metric analysis. Initially, we prepared the data using the breast cancer dataset from sklearn, which contains various features related to tumor characteristics such as radius, texture, perimeter, and more.

This dataset was divided into training and testing sets to help with model training and evaluation. With KNN classification, we predicted tumor diagnoses based on the features provided. For each test data point, we calculated the Jaccard similarity coefficient with all training data points to assess the similarity between tumors. Then, we selected the top K most similar cases as neighbors and predicted the tumor type for each test data point based on the most common tumor type among its neighbors.

Following the prediction phase, we evaluated the model’s performance by comparing the predicted tumor types with the actual tumor types in the testing set. This evaluation allowed us to calculate the accuracy of the predictions, providing insights into the predictive power of the selected features and their relationship to whether or not the tumor is malignant or benign. Ultimately, our analysis yielded a prediction accuracy of approximately 0.40, indicating moderate performance in predicting tumor types based on the given characteristics. Through this analysis, we gained some insights into the correlation between tumor characteristics and malignancy, which can inform future research on tumor characteristics and classification.

Additionally, we utilized the breast cancer dataset from sklearn, loading it into a pandas DataFrame and ensuring any missing values were appropriately handled. Then, we used Euclidean distance as a similarity metric to calculate the similarity between a target case and all other cases in the dataset. This approach allowed us to identify the top 10 most similar cases to the target case based on their symptom profiles. The top 10 most similar cases to a malignant example are patient ID 337, 254, 56, 70, 300, 24, 218, 252, and 256. The top 10 most similar cases to a benign example are patient ID 220, 423, 523, 308, 93, 49, 526, 224, and 442.

We uncovered relationships and patterns inherent in the dataset. By analyzing similarities between cases, we figured out some of the underlying structure of tumor characteristics and their implications for malignancy likelihood. This approach provided an understanding of the data’s inherent similarities, laying the groundwork for further analysis.

In summary, our utilization of similarity metric analysis provided valuable insights into the relationships and patterns present within the breast cancer dataset. By quantifying similarities between tumor cases, we gained a deeper understanding of the dataset’s structure and its relevance to tumor diagnosis. Through this analysis, we contributed to the exploration of tumor characteristics and their correlation with malignancy, which then helps in further investigation and potential advancements in clinical decision-making processes.

Outcome

It was found that the average radius of malignant tumors was ~15.55 while benign tumors had an average of 13.28. Malignant tumors also had a larger standard deviation of 4.131 while this value resulted in 2.79. The largest value of a malignant tumor found in this dataset had a radius of 28.11 while the largest benign tumor had a value of 21.16. The smallest sizes found for the different types of tumors were 7.691 for malignant and 6.981 for benign. These values were found by creating two different lists of dictionaries, each representing a different breast cancer case. The lists were split by benign or malignant. After these lists were created, only the radius value was taken into account, and then the values above were found by performing various equations on the data. More information on the code can be found at this github link: https://github.com/anapetsmart/INST414/blob/main/module1.py An issue that may occur when performing this analysis is incorrectly creating the different lists. If the data is loaded incorrectly or someone accidentally sorts by the wrong dictionary value, this could result in any subsequent analysis being inaccurate since it would include cases that were not relevant to the analysis being done. Additionally, the data is sorted to easily find the highest and lowest sizes of each category. If someone were unfamiliar with sorting by dictionary values in this manner, then this could cause issues when comparing largest and smallest-sized tumors. The easiest way to solve these issues is to practice manipulating lists of dictionaries before dealing with larger datasets, such as this one. A table of all the values mentioned is posted below:

These values show that malignant tumors, on average, tend to be larger than benign tumors. As can be seen through the values, in all of the categories measured, malignant tumors were found to be larger. The most surprising difference was that the largest malignant tumor had a difference of 7.04 when comparing it to the largest benign case.

When examining the similarity metrics, it was found that the cases that were most similar to a malignant tumor had an average smoothness value of 0.102629, which is in comparison to the mean smoothness of a benign tumor that had a value of 0.092758. This shows a difference of 0.009871, indicating that smoothness may be a factor in malignancy diagnosis. Concavity was another factor that showed a significant difference between the two groups. The average of the concavity for the malignant tumor had a value of 0.17038, in comparison to the benign similarity which resulted in a mean of 0.0291466. This difference in average results in 0.1412334. Texture averages did not have any significant differences.

The KNN classification process showed some issues of classifying tumors as benign when the case would be malignant. This is a large cause for concern because if a tumor is mislabelled as benign, when it is malignant, it could lead to the illness progressing due to lack of appropriate treatment. When examining the model output, none of the characteristics were shown to be outliers that would confuse the model or lead to the inaccuracy. In future iterations of classification models, the less than 50% accuracy rate would need to be addressed, in addition to the model over-classifying benign types..

Decision Insights

As a result of this analysis, oncologists can better predict and determine whether or not a tumor is malignant. This analysis implies that people should be more concerned if they have a larger tumor because there is a higher likelihood of it being malignant. This conclusion does not mean that smaller tumors should be ignored, instead, it implies the opposite; that smaller tumors should be treated before growing larger and becoming malignant. While this analysis of the dataset does result in the conclusion that malignant tumors tend to be larger, this does not mean that smaller abnormalities in breast tissue should be ignored. As can be seen by the data, there are malignant tumors on the smaller end of the spectrum. Also, even though benign growths tend to need less extensive treatment than malignant ones, someone should still seek treatment if the result is benign. When examining the different characteristics for similarity metrics and KNN classification, it can be seen that doctors should be paying attention to concavity and smoothness. The difference in averages for those metrics was deemed to be significant, meaning that during the diagnostic process, an oncologist should be monitoring tumor concavity and smoothness as an indicator of malignancy. However, since texture does not show any significant difference in averages between the two groups, an oncologist can choose to spend less time examining tumor texture as it is less likely to be an indicator of tumor type. Additionally, the classification example emphasized the importance of accuracy of tumor classification and highlighted that additional features may need to be taken into account in the diagnostic process. This is due to the KNN classification algorithm only having a 40% accuracy when only basing the classification on tumor characteristic and no other attributes of the patient.

Limitations

There were some limitations to this analysis. Radius was taken into account to stay consistent with how tumors are measured in pathology labs, but other factors were not taken into account since the exploratory data analysis question was focused on size. When doing both self-assessments and clinical testing for tumors, other factors that contribute to overall health should be taken into account. The Kaggle and sklearn datasets did not include information about vital signs, whether a patient participated in activities that would increase their risk for breast cancer, whether the patient in the case study had breast cancer previously, or whether the patient had any other medical conditions. Stage four cancer is also categorized by the ability to metastasize and affect other organs in the body which can be hard to determine from just size. Even with these limitations, it is still important to get a breast abnormality checked out before it gets larger because larger tumors are correlated with malignancy. When factoring in the additional dataset and characteristics, there were similar limitations to the exploratory analysis portion. The different stages of cancer were not included in the dataset, and it only indicated if there was malignancy present. This could cause bias in the data, because a metastasized tumor could have different characteristics from other types of malignant tumors, leading to some inaccuracy in the conclusions.

Github repo:

https://github.com/anapetsmart/INST414-Final

Appendix on work performed

--

--