Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Automatic Model Selection and Parameter Optimization with Hyperopt-Sklearn for Breast Cancer Classification .

--

Photo by Angiola Harry on Unsplash

Breast cancer is the most common cancer in the United States and the second leading cause of cancer death after lung cancer (2017 data). Accurate diagnosis is imperative for proper breast cancer management, including treatment selection. Pathologists evaluate the lesion’s overall architecture and morphological features like nucleus to cytoplasm ratio and various cytoplasmic and nuclear features. The morphological features of nuclei are essential in making the cancer diagnosis.

Thanks to the researchers from the University of Wisconsin, we have a great dataset of nuclear features from benign and malignant cells. The breast cancer dataset is included in the sklearn.datasets library. Let’s have a quick look into it.

The dataset contains 569 instances, 30 numeric predictive attributes, and two classes. There is a slight class imbalance with the benign class containing 357 entries and the malignant class containing 212 entries.

Now let’s see how these predictive attributes correlate with each other and with the target.

Due to a large number of attributes, it may be easier to read a bar chart.

As we can see, “smoothness error” positively correlates with a cancer diagnosis but not as strong as the negative correlation of “worst concave points”, “worst perimeter”, and “mean concave points.” It makes total sense from a pathologist’s perspective since the larger and more irregular nuclei are, the more likely they are malignant.

With this short introduction to the breast cancer dataset, we can now find the best model and best parameters to classify the breast lesions into benign versus cancer based on the above 30 nuclear attributes.

To accomplish this goal, we are going to use hyperopt-sklearn. Hyperopt-sklearn is a software project that provides an automatic algorithm configuration of the Scikit-learn machine learning library. The best description of the library comes from Brent Komer article: “Following Scikit-learn’s convention, hyperopt-sklearn provides an Estimator class with a fit method and a predict method. The fit method of this class performs hyperparameter optimization, and after it has completed, the predict method applies the best model to test data. Each evaluation during optimization performs training on a large fraction of the training set, estimates test set accuracy on a validation set, and returns that validation set score to the optimizer. At the end of search, the best configuration is retrained on the whole data set to produce the classifier that handles subsequent predict calls.”

Let’s see how it works in practice.

I trained the model several times, and these are the best results I achieved.

SGDClassifier with a score of 0.97.

GradientBoostingClassifier with a score of 0.94.

AdaBoostClassifier and ExtraTreesClassifer got the same scores of 0.93.

KNeighborClassifer got the worst score of 0.87.

In summary, the Hyperopt-sklearn python package enables automatic model selection and parameter search. It supports standard machine learning algorithms provided by Scikit-Learn. It is easy to use, and it dramatically speeds up the model selection and initial parameter setup. Further parameter tuning may improve the results. The only drawback is the lack of support for K-fold cross-validation, limiting its use for markedly imbalanced datasets.

Thank you for taking the time to read this post.

Best wishes in these difficult times.
Andrew
@tampapath

--

--

Analytics Vidhya
Analytics Vidhya

Published in Analytics Vidhya

Analytics Vidhya is a community of Generative AI and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Andrew A Borkowski
Andrew A Borkowski

Written by Andrew A Borkowski

Pathologist and Deep Learning Enthusiast

No responses yet