Bio-informatics, The data science of Biotechnology: Cancer Detection Analysis

Mustaffa Hussain
TheCyPhy
Published in
6 min readFeb 12, 2020

Yeah, you read that right. Bioinformatics, data science for the biological field. There is a lot of research happening and data that is generated for medical diagnosis, DNA-RNA-Metabolite sequence isolation, drug discovery, etc.

Different ways in which data is generated for Bioinformatics.

If you are into data sciences and want to see what all can be be done except Natural language processing, Computer Vision, Time sequence analysis, etc; viola Here we are.

In this blog, we shall be trying our hands on the Breast Cancer Wisconsin (Diagnostic) Data Set.

Two different types of cells present in the data set. Features are derived for these cells from the FNA images.

About the Dataset- Features are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image... and more … *biology.

What I understand is that Breast cancer is the most common invasive cancer in women and the second leading cause of cancer death in women after lung cancer. Advances in screening and treatment for breast cancer have improved survival rates dramatically since 1989. Diagnosis of breast cancer is done in 3 ways :

  • Breast exam- these are usually physical and routine examinations by doctors.
  • Imaging test- these are a set of different imaging techniques like Breast tomosynthesis, mammograms, Breast MRI, etc.
  • Biopsy- it is a procedure to remove a piece of tissue or a sample of cells from your body so that it can be analyzed in a laboratory.

Coming back to the data set, it contains features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. These features are derived by subject experts from the images. Now that we have the basic background and understanding of the data, we can jump into coding.

Thats how it works here.

So lets code, let's DETECT

we need to load a few libraries before we can get started. These are pretty much like the prerequisites for any python script dealing with data.

Task- Predict whether cancer is benign or malignant.

now that we have loaded the tools for the data, we need to load our dataset for operations. The data set used in the article can be found here.

The loaded data looks like this

Basic operations

we perform some basic operation on the loaded data. The dataset is of size 569x32. This implies there are 569 patients(rows) each with 32 features(columns).

Furthermore, none of the columns have any missing values or NAN values. All columns have data of float64 type except the Diagnosis feature. This feature is used as our class variable. All the entries with ‘M’ represent Malignant tumors and the one with ‘B’ stands for Benign tumors.

Moreover, it is always advised to check for class imbalance before feeding data into models. Many models learn weird representations or favour dominant classes for predictions. Our simple bar plot analysis tells there is not a major imbalance in the data from both labels. One can always upsample or downsample in case of severe imbalance. You can read more about this from [1] and [2]

Handling the class labels: We have a categorical class label which needs to be converted to numeric for easier handling and model interpretability.

We check for correlations between the features for insights upon patterns and redundant features. The plot for the correlations as obtained.

Correlations between the different features.

The basic analysis is done.

Now we need to prepare the data for training and testing. We split the data into train and test splits of 70:30.

Models in use

I intend to use a set linear as well as non-linear classifiers for this data and see what performs well. Moreover, we can make some observations about these models based on the behavior of the train data metric and test data metric. The models on which the data was trained and tested include

  • Logistic regression
  • K nearest neighbors
  • Support vector machine (linear)
  • Support vector machine(kernel)
  • Naive Bayes
  • Decision trees
  • Random forest
  • Artificial neural network

Observations made from the results:

  • It is not necessary that the model which performs well on training data is the one that fares well for test data too. That's why it is advised, not to pick up any model from just a quick preliminary test. We can see the decision tree model perform perfectly on train data but doesn't fare well in term of test data. This is because of the model overfitting on training data.
  • It is essential to standardize the data before feeding the models. Models such as logistic regression, SVM, KNN are all based on distances. So these are models are heavily affected by non-standardized data.
  • It is not necessary that non-linear models always perform better than linear models. We can see that linear SVM really well and takes away the unnecessary complexity from the process, resulting in simplicity and easy interpretability.

It is very important to evaluate the performance of a model with an appropriate metric. There are a lot of metrics for classification. The Metric used is Accuracy. It is referred to as the number of correct predictions to total no of predictions. Classification accuracy usually works well when datasets are balanced. Our prior interaction with the data above revealed no major imbalance. To read more about metrics, please refer [4]

The Results after the experiment are compiled as:

The notebook and analysis presented in the blog can be found in detail here. let's share awareness and beat the shit out of this bitch.

References

[1] https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data

[2] https://elitedatascience.com/imbalanced-classes

[3] https://www.kaggle.com/uciml/breast-cancer-wisconsin-data

[4] https://towardsdatascience.com/metrics-to-evaluate-your-machine-learning-algorithm-f10ba6e38234

[5] https://github.com/mustaffa-hussain

--

--

Mustaffa Hussain
TheCyPhy

M.Sc Computer Science from South Asian University. I write to understand. Portfolio link- mustaffa-hussain.github.io/Portfolio