Introduction to Machine Learning — Breast Cancer Case Study

Karan Arya
NLP Gurukool
Published in
3 min readDec 27, 2018

Loading a Sample Dataset

import pandas as pd
import numpy as np
from sklearn import datasets
cancer = datasets.load_breast_cancer()

The input data can be accessed by:

cancer.data

The target variable data can be accessed by:

cancer.target

Column headers of input variables can be accessed by:

cancer.feature_names

We can see the cancer dataset type by type(cancer). The type sklearn.utils.Bunch needs to be converted to a Pandas dataframe.

Converting the Sklearn Dataset to Pandas Dataframe

cancer.keys()

We need to make sure that all the columns are added to our dataframe.

Therefore, we create and concatenate target data and input data together using numpy library.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))
df.head()

Machine Learning and Predicting

We load a support vector machine learning model to perform binary classification.

from sklearn import svm
clf = svm.SVC(gamma = 0.001, C = 100)

Selecting X as input variable column set and y as target variable column.

X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]

Train and Test Data Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Feature Scaling

from sklearn.preprocessing import StandardScalersc = StandardScaler()  
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Fit SVM Model to Train Data

model = clf.fit(X_train, y_train)
predictions = clf.predict(X_test)

The top 10 predictions on test set can be seen by:

predictions[0:10]

SVM Model Results

from sklearn.metrics import classification_report, confusion_matrix, accuracy_scoreprint(confusion_matrix(y_test,predictions))  
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

Output:

[[41  2]
[ 0 71]]
precision recall f1-score support

0.0 1.00 0.95 0.98 43
1.0 0.97 1.00 0.99 71

avg / total 0.98 0.98 0.98 114

0.9824561403508771

Fit Random Forest Model to Train Data

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf_model = rf.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)

Random Forest Model Results

print(confusion_matrix(y_test, rf_predictions))  
print(classification_report(y_test, rf_predictions))
print(accuracy_score(y_test, rf_predictions))

Output:

[[40  3]
[ 1 70]]
precision recall f1-score support

0.0 0.98 0.93 0.95 43
1.0 0.96 0.99 0.97 71

avg / total 0.97 0.96 0.96 114

0.9649122807017544

So, the SVM performs better in this case. This does not necessarily mean that the model will always perform better.

Saving a ML Model/ Model Persistence

It is possible to save a model in scikit-learn by using Python’s built-in persistence model, pickle:

import pickle
s = pickle.dumps(model)
model2 = pickle.loads(s)
model2.predict(X_test)

Let us test whether this SVM model gives same results as the previous one to verify whether it was saved correctly.

print(confusion_matrix(y_test,predictions))  
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))

Output:

[[41  2]
[ 0 71]]
precision recall f1-score support

0.0 1.00 0.95 0.98 43
1.0 0.97 1.00 0.99 71

avg / total 0.98 0.98 0.98 114

0.9824561403508771

Results are exactly the same. The model runs good and saves correctly!

Before you leave,

If you enjoyed this post, please make sure to follow the NLP Gurukool page and visit the publication for more exciting tutorials and blogs on machine learning, data science and NLP.

Please get in touch if you would like to contribute to our publication.

--

--