SVM — Breast Cancer— Start to Finished

A Complete Colab Notebook Using the Breast Cancer Data Set from UCI — AISeries — Episode #06

Published in

Jungletronics

5 min readMay 26, 2021

Hi, we going to apply the SVM — Support Vector Machine in the UCI database that comes into the scikit-learn: colab notebook link.

sklearn.datasets.load_breast_cancer

This is a copy from the University of California, Irvine (UCI) Machine Learning Repository dataset.

Let’s get started!

01#Step — Open your Google colab and type this:

import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline

02#Step — Let’s download the Breast Cancer Dataset from the Skit-Learn:

from sklearn.datasets import load_breast_cancercancer  = load_breast_cancer()print(cancer.DESCR)[basically what this is it's 569 instances with 30 numeric attributes and the prediction that you want to do as far as the class is whether or not this tumor for this breast cancer is malignant or benign]

03#Step — Let’s see the dictionaries available:

cancer.keys()dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])

04#Step — Let’s create a Pandas’ Dataframe to work with:

df_feat = pd.DataFrame(data=cancer['data'], columns=cancer['feature_names'])df_feat.head(2)

Click to see better…

df_feat.info()cancer.target_namesarray(['malignant', 'benign'], dtype='<U9')

05#Step — Train Test Split:

from sklearn.model_selection import train_test_splitX = df_featy = cancer['target']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

06#Step —Let’s grab & train the support vector classifier mode:

from sklearn.svm import SVCmodel = SVC()model.fit(X_train, y_train) # Fit the model to the training dataSVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,     decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf',     max_iter=-1, probability=False, random_state=None, shrinking=True,     tol=0.001, verbose=False)

07#Step —Prediction:

predictions = model.predict(X_test)

08#Step — Confusion Matrix & Classification Report:

from sklearn.metrics import classification_report, confusion_matrixprint(confusion_matrix(y_test, predictions))print('/n')print(classification_report(y_test, predictions))
[[ 56  10]
 [  3 102]]
/n
              precision    recall  f1-score   support

           0       0.95      0.85      0.90        66
           1       0.91      0.97      0.94       105

    accuracy                           0.92       171
   macro avg       0.93      0.91      0.92       171
weighted avg       0.93      0.92      0.92       171

Let’s get the graph:

from sklearn.metrics import confusion_matrixfrom sklearn.metrics import plot_confusion_matrixplot_confusion_matrix(model, X_test_scaled, y_test, values_format='d', display_labels=['malignant', 'benign'])

Analyzing Confusion Matrix:

From 56 + 10= 66 people that have malignant cancer, 10 was misclassified (15%)

From 3 + 102= 105 people that have benign cancer, 3 was misclassified (3%)

Let’s see if we can do better…

09#Step —Let’s use GridSearch & Train:

A grid search allows you to find the right parameters such as what C or gamma values to use and finding those right parameters is usually a tricky task.

But luckily we can be a little lazy and just try a bunch of combinations and see what works best.

from sklearn.model_selection import GridSearchCVparam_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1, 0.01,0.001,0.0001]}grid_model = GridSearchCV(SVC(), param_grid, verbose=3)grid_model.fit(X_train, y_train)

grid_model.best_params_{'C': 1, 'gamma': 0.0001}

A large C value gives you low bias and high variance in the model or vice versa.

a large Gamma value is going to lead to a high bias and low variance in the model or vice versa.

So if the Gamma is large then the variance is small implying that the support vector does not have a widespread influence.

So that has to do with that bias-variance tradeoff.

grid_model.best_estimator_SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,     decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf',     max_iter=-1, probability=False, random_state=None, shrinking=True,     tol=0.001, verbose=False)

10#Step — Predictions and Confusion Matrix & Classification Report:

grid_predictions = grid_model.predict(X_test)print(confusion_matrix(y_test, grid_predictions))print('/n')print(classification_report(y_test, grid_predictions))[[ 59  7]
 [  4 101]]
/n
              precision    recall  f1-score   support

           0       0.94      0.89      0.91        66
           1       0.94      0.96      0.95       105

    accuracy                           0.94       171
   macro avg       0.94      0.93      0.93       171
weighted avg       0.94      0.94      0.94       171

Now, The Graph:

plot_confusion_matrix(grid_model, X_test, y_test, values_format='d', display_labels=['malignant', 'benign'])

Analyzing Confusion Matrix:

From 59 + 7= 66 people that have malignant cancer, 7 was misclassified (10%)

From 4+ 101= 105 people that have benign cancer, 4 was misclassified (3%)

OK! That’s all!

I hope you enjoyed that lecture.

If you find this post helpful, please click the applause button and subscribe to the page for more articles like this one.

Until next time!

I wish you an excellent day!

Download The File For This Project

28_breast_cancer_svm.ipynb

Credits & References

Based on: Python for Data Science and Machine Learning Bootcamp by Jose Portilla

sklearn.datasets.load_breast_cancer — The breast cancer dataset is a classic and very easy binary classification dataset. Download: skit-learn page

00#Episode — AISeries — ML — Machine Learning Intro — What Is It and How It Evolves Over Time?

01#Episode — AISeries — Huawei ML FAQ — How do I get an HCIA certificate?

02#Episode — AISeries — Huawei ML FAQ Again — More annotation from Huawei Mock Exam

03#Episode — AISeries — AI In Graphics — Getting Intuition About Complex Math & More

04#Episode — AISeries — Huawei ML FAQ — Advanced — Even More annotation from Huawei Mock Exam

05#Episode — AISeries — SVM — Credit Card — Start to Finished — A Complete Colab Notebook Using the Default of Credit Card Clients Data Set from UCI

06#Episode — AISeries — SVM — Breast Cancer — Start to Finished— A Complete Colab Notebook Using the Default of Credit Card Clients Data Set from UCI (this one)

07#Episode — AISeries — SVM — Cupcakes or Muffins? — Start To Finished — Based on Alice Zhao post

Pros & Cons of SVMPros
   - Good at dealing with high dimensional data;
   - Works well on small data sets.Cons
   - Picking the right kernel and paramenters can be computationally intensive.

Classification TechniquesSVM is one of many Classification Tecniques - Logistic Regression
 - K Nearest Neighbors
 - Decision Tree
 - Naives Bayes
 - Neural NetworksAdvice: Try multiple tecniques on your data set.

Take the road less traveled!

SVM — Breast Cancer— Start to Finished

A Complete Colab Notebook Using the Breast Cancer Data Set from UCI — AISeries — Episode #06

Credits & References

Related Posts

Written by J3