Leveraging Machine Learning to Predict Breast Cancer Diagnosis

AIMS Next Einstein Initiative
AIMS Community Digest
3 min readAug 3, 2020
Creator: sudok1 | Credit: Getty Images/iStockphoto

According to medical researchers, breast cancer is the most common invasive cancer in women worldwide. Breast cancer is caused by abnormal mammal breast cells that divide uncontrollably and stubbornly invade other body tissues through the blood and lymphatic systems. Globally, it affects 1 in 7 (14%) of women. The United States of America recorded 2.8 million breast cancer cases in 2015. According to experts, younger women are more vulnerable and tend to have the most aggressive breast cancer than older women. The latest report from the National Cancer Institute, from 2012 to 2016 reported that the breast cancer death rates among women between the ages of 20 and 49 were more than double that of other causes of cancer among women across the world. Breast cancer remains a significant health issue even after five decades, and its momentum incidence has significantly increased in recent years.

Despite the advancement in technology and medicine, health systems are accelerating the rapid growth in cancer detection and diagnosis. The myriad of cancer-preventive screening and diagnostic approaches have been put in place to decline new cancer cases in the community relevantly. Among those adopted techniques, artificial intelligence (AI) and machine learning (ML) couldn’t be left behind. Thus, AI and ML significantly and surprisingly support cancer detection and prediction beyond rational expectations. Due to this fact, this work intends to employ the machine learning-based ensemble approach to classify, detect and predict the breast cancer diagnosis decision. The ensemble technique consists of a combination of multiple machine learning techniques for improving and boosting prediction performance.

Our method

In our piece of work, we conducted an experimental comparison of twelve individual supervised learning models trained on breast cancer data set, which is publicly available on the UCI Machine Learning Repository. The models have been trained by using 10-fold cross-validation repeatedly five times as a resampling method and indeed. They have been tuned to find the optimal models, and we set a hundred iterations and the random seed to initialize a pseudo-random number generator for the avoidance of the variability of the results to obtain the nearly trusted experimental results. The best optimal model is selected to be a relevant candidate for ensemble technique. Then the final performance results are analyzed by stacking technique (sometimes called stack generalization) which consists an equal contribution of trained sub-models for a combined prediction

Outcome

The experimental results show that the stacking ensemble method outperforms the optimal individual classification method. The obtained results show that the proposed model improves both classification accuracy and Cohen’s Kappa Statistics compared to simply selecting a single best optimal classifier in the learning model combination. The results indicate that the individual Kernel k-Nearest Neighbor (kkNN) and Adaboost as the optimal classifiers relative to others with the mean accuracy of 99.55%and 99.32% respectively and average Cohen Kappa of 99.11% and 98.64% respectively. In the second stage, we combine the results of individual models to perform the Stack Generalization approach using the combining method proposed to obtain results which are significantly better and higher than using either kernel kNN or Adaboost alone. The experimental results show an increase in the predictive performance with the accuracy and Cohen kappa of 99.90% and 99.81% respectively, which show the rise of 0.35% on model accuracy and 0.7% on model Cohen Kappa.

Future work

In conclusion, our work analyzed the performances of popular ensemble methods like Stack Generalization. We found out that it outperforms than making a simple selection of the optimal single classifier in the combination algorithms trained on breast cancer data set used in our experiment. To obtain this, we mostly tested our stacking model on two-class data set for which it performed admirably. In future, we can extend the combination rules and employ other ensemble technique like bagging and boosting so that they work even better for both binary and multi-class problems.

Written by Murera Gisa, AIMS Alumni’19 (Rwanda)

Murera Gisa is a Data Scientist and Economist. His fields of practice include data analytics and machine learning. He enjoys making various types of data speak for themselves and communicating to the diverse audiences.

--

--

AIMS Next Einstein Initiative
AIMS Community Digest

The African Institute for Mathematical Sciences (AIMS) is a pan-African network of centres of excellence for post-graduate training and research.