Introduction to Machine Learning — Breast Cancer Case Study

Loading a Sample Dataset

import pandas as pd
import numpy as np
from sklearn import datasets
cancer = datasets.load_breast_cancer()

The input data can be accessed by:

The target variable data can be accessed by:

Column headers of input variables can be accessed by:


We can see the cancer dataset type by type(cancer). The type sklearn.utils.Bunch needs to be converted to a Pandas dataframe.

Converting the Sklearn Dataset to Pandas Dataframe


We need to make sure that all the columns are added to our dataframe.

Therefore, we create and concatenate target data and input data together using numpy library.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))

Machine Learning and Predicting

We load a support vector machine learning model to perform binary classification.

from sklearn import svm
clf = svm.SVC(gamma = 0.001, C = 100)

Selecting X as input variable column set and y as target variable column.

X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]

Train and Test Data Split

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

Feature Scaling

from sklearn.preprocessing import StandardScalersc = StandardScaler()  
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

Fit SVM Model to Train Data

model =, y_train)
predictions = clf.predict(X_test)

The top 10 predictions on test set can be seen by:


SVM Model Results

from sklearn.metrics import classification_report, confusion_matrix, accuracy_scoreprint(confusion_matrix(y_test,predictions))  
print(accuracy_score(y_test, predictions))


[[41  2]
[ 0 71]]
precision recall f1-score support

0.0 1.00 0.95 0.98 43
1.0 0.97 1.00 0.99 71

avg / total 0.98 0.98 0.98 114


Fit Random Forest Model to Train Data

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)
rf_model =, y_train)
rf_predictions = rf_model.predict(X_test)

Random Forest Model Results

print(confusion_matrix(y_test, rf_predictions))  
print(classification_report(y_test, rf_predictions))
print(accuracy_score(y_test, rf_predictions))


[[40  3]
[ 1 70]]
precision recall f1-score support

0.0 0.98 0.93 0.95 43
1.0 0.96 0.99 0.97 71

avg / total 0.97 0.96 0.96 114


So, the SVM performs better in this case. This does not necessarily mean that the model will always perform better.

Saving a ML Model/ Model Persistence

It is possible to save a model in scikit-learn by using Python’s built-in persistence model, pickle:

import pickle
s = pickle.dumps(model)
model2 = pickle.loads(s)

Let us test whether this SVM model gives same results as the previous one to verify whether it was saved correctly.

print(accuracy_score(y_test, predictions))


[[41  2]
[ 0 71]]
precision recall f1-score support

0.0 1.00 0.95 0.98 43
1.0 0.97 1.00 0.99 71

avg / total 0.98 0.98 0.98 114


Results are exactly the same. The model runs good and saves correctly!

