Predicting Breast Cancer Diagnoses

Michela Tjan
5 min readMay 2, 2024

--

Photo by Angiola Harry on Unsplash

As the Spring 2024 semester comes to an end, I would like to share one of my projects. One of the requirements for the CSC642: Statistical Learning with Applications class by Prof. Vanessa Aguiar-Pulido was a semester-long group project. I was partnered with Michelle Manfrini and we decided to focus our project on health.

Ideation

Since breast cancer is one of the leading causes of death among women, it is important to begin screening for breast cancer early in order to increase the chances of successful treatments. A robust model can assist medical professionals in identifying cases and reducing breast cancer risk.

Datasets

While we were searching for datasets, we came across the Breast Cancer Coimbra dataset from the UCI Machine Learning Repository. It contains 116 instances based on clinical observations from 64 patients with breast cancer and 52 healthy controls. The dataset is small and plenty of research have been conducted using this data. In order to mitigate these issues, we found a synthetic dataset found in Kaggle derived from the Breast Cancer Coimbra dataset. Our solution was to utilize these two datasets within our project.

Methods

We conducted data pre-processing and exploratory data analysis (EDA) prior to fitting models to the data. This comprised of data type modification, histogram and correlation matrices plotting, and principal component analysis (PCA).

The synthetic data was split into a 70:30 ratio for training and validation sets, while the original data was used as the test set. The data is fit into four classical machine learning models:

  • K-Nearest Neighbors Classifier (KNN)
  • Linear Discriminant Analysis (LDA)
  • Quadratic Discriminant Analysis (QDA)
  • Logistic Regression

Hyperparameters tuning and resampling using 5-fold cross-validation were conducted on the models before evaluation using the following methods and metrics:

  • Hold-out approach
  • Confusion matrix
  • Cross-validation
  • Precision
  • Recall
  • Accuracy

Results

Histograms and Bar Plots

Histograms of quantitative variables. Left: original dataset. Right: synthetic dataset.

The histograms of the numeric variables in both datasets exhibited distributions that deviate from the normal distribution. Since none of the variables presents a normal distribution, the predictions made from the model trained and tested based on these datasets may be affected.

Target variable barplot. Left: original dataset. Right: synthetic dataset.

The distribution of the binary categorical variable of both datasets seem to be fair; healthy controls are noted by the number 1 and patients with breast cancer by the number 2. The number of observations in each category is not equal, but it does not appear to be problematic.

Summary Statistics

The summary statistics of the numerical variables of both datasets reveals that the mean and medians of the variables in both datasets appear to be similar in values. The difference lies in the min, max, first quartile, and third quartile values. The synthetic dataset’s variables seem to have a smaller range of values, since the difference between the min and max values is closer than the original dataset’s difference between these values.

Correlation Matrices

Correlation matrices. Left: original dataset. Right: synthetic dataset.

Based on the correlation matrix, the variables in the synthetic dataset seem to have low correlation since its correlation coefficients are close to zero. The opposite is true for the original dataset. Its variables exhibit medium to strong correlations. This result is surprising, considering the synthetic dataset is generated from a deep learning model trained on the original dataset.

Principal Component Analysis (PCA)

PCA results. Left: scatterplot. Right: histogram.

Homogeneous data would display points from both datasets distributed similarly without forming distinct clusters in a scatterplot. Based on this definition, the scatterplots generated in this method seem to suggest that the datasets are homogeneous, since there are no clusters and points from both datasets are spread out.

The histogram shapes of both the synthetic and original dataset are very similar. They exhibit similar distribution shapes with centers that are close or aligned. The spread of the histograms mimic each other, further indicating homogeneity. Based on the results, the variance explained and proportion of variance explained do not have noticeably large values.

Evaluation of Models

Cross-Validation Summary of Model Metrics on Test Set After Hyperparameter Tuning
Confusion Matrix of Models After Hyperparameter Tuning: Test Set

The KNN model outperforms the LDA model in terms of accuracy, recall, and test error. The logistic regression model seems to have a lower accuracy rate than all the other models, but achieved the highest precision and F1 score.

Although the LDA seems to be the best performing model in terms of all metrics, its confusion matrix seems to disagree. The model has a high tendency to predict that patients have breast cancer, and this situation is also evident in the logistic regression model. The KNN and QDA models did not outperform the LDA model, but they seem to be predicting more evenly, with a lower tendency to predict a certain class.

Based on these observations, there does not seem to be a model that performs significantly better than the other models. There are trade-offs when selecting the “best” model, but the KNN seems to be the highest performing model based on all the evaluation metrics including the confusion matrix, since it does not have a high tendency to predict one class over the other.

Future Work

Previous studies trained their models solely on the Breast Cancer Coimbra dataset, which is relatively small. However, we trained our model on the large synthetic dataset and tested on the original dataset. This method has the potential to increase model performance as it will better generalize to wider cases. However, future studies should assess additional prediction models, feature selection methods, and more expansive clinical datasets. More research must be conducted to verify the relationship between these quantitative attributes and a true diagnosis of breast cancer.

Conclusion

I truly enjoyed collaborating with Michelle on this project. It was my first time working with both synthetic and original datasets in a single project and found that the method used to generate synthetic datasets is very important. Although the synthetic dataset was derived from the original dataset, it does not mean that it can be used to predict the original well. Also, more data does not always yield better results!

If you would like to view the source code and the final project report, check out my GitHub repository.

Please feel free to follow my Medium, LinkedIn, and Portfolio to be updated on my work!

--

--

Michela Tjan

My uni didn’t offer CS, so I graduated as the Valedictorian majoring in Marketing and Entrepreneurship | Currently in MS in Data Science 🚀 — @tjanmichela