Introduction to Machine Learning — Breast Cancer Case Study
Loading a Sample Dataset
import pandas as pd
import numpy as np
from sklearn import datasets
cancer = datasets.load_breast_cancer()
The input data can be accessed by:
cancer.data
The target variable data can be accessed by:
cancer.target
Column headers of input variables can be accessed by:
cancer.feature_names
We can see the cancer dataset type by type(cancer)
. The type sklearn.utils.Bunch
needs to be converted to a Pandas dataframe.
Converting the Sklearn Dataset to Pandas Dataframe
cancer.keys()
We need to make sure that all the columns are added to our dataframe.
Therefore, we create and concatenate
target data and input data together using numpy
library.
import pandas as pd
import numpy as npdf = pd.DataFrame(np.c_[cancer['data'], cancer['target']], columns = np.append(cancer['feature_names'], ['target']))
df.head()
Machine Learning and Predicting
We load a support vector machine
learning model to perform binary classification.
from sklearn import svm
clf = svm.SVC(gamma = 0.001, C = 100)
Selecting X
as input variable column set and y
as target variable column.
X = df.iloc[:, 0:-1]
y = df.iloc[:, -1]
Train and Test Data Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state = 42)print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
Feature Scaling
from sklearn.preprocessing import StandardScalersc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
Fit SVM Model to Train Data
model = clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
The top 10 predictions on test set can be seen by:
predictions[0:10]
SVM Model Results
from sklearn.metrics import classification_report, confusion_matrix, accuracy_scoreprint(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
Output:
[[41 2]
[ 0 71]]
precision recall f1-score support
0.0 1.00 0.95 0.98 43
1.0 0.97 1.00 0.99 71
avg / total 0.98 0.98 0.98 114
0.9824561403508771
Fit Random Forest Model to Train Data
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators = 1000, random_state = 42)rf_model = rf.fit(X_train, y_train)
rf_predictions = rf_model.predict(X_test)
Random Forest Model Results
print(confusion_matrix(y_test, rf_predictions))
print(classification_report(y_test, rf_predictions))
print(accuracy_score(y_test, rf_predictions))
Output:
[[40 3]
[ 1 70]]
precision recall f1-score support
0.0 0.98 0.93 0.95 43
1.0 0.96 0.99 0.97 71
avg / total 0.97 0.96 0.96 114
0.9649122807017544
So, the SVM performs better in this case. This does not necessarily mean that the model will always perform better.
Saving a ML Model/ Model Persistence
It is possible to save a model in scikit-learn by using Python’s built-in persistence model, pickle
:
import pickle
s = pickle.dumps(model)
model2 = pickle.loads(s)
model2.predict(X_test)
Let us test whether this SVM model gives same results as the previous one to verify whether it was saved correctly.
print(confusion_matrix(y_test,predictions))
print(classification_report(y_test,predictions))
print(accuracy_score(y_test, predictions))
Output:
[[41 2]
[ 0 71]]
precision recall f1-score support
0.0 1.00 0.95 0.98 43
1.0 0.97 1.00 0.99 71
avg / total 0.98 0.98 0.98 114
0.9824561403508771
Results are exactly the same. The model runs good and saves correctly!
Before you leave,
If you enjoyed this post, please make sure to follow the NLP Gurukool page and visit the publication for more exciting tutorials and blogs on machine learning, data science and NLP.
Please get in touch if you would like to contribute to our publication.