SVM — Breast Cancer— Start to Finished
A Complete Colab Notebook Using the Breast Cancer Data Set from UCI — AISeries — Episode #06
Hi, we going to apply the SVM — Support Vector Machine in the UCI database that comes into the scikit-learn: colab notebook link.
sklearn.datasets
.load_breast_cancer
This is a copy from the University of California, Irvine (UCI) Machine Learning Repository dataset.
Let’s get started!
01#Step — Open your Google colab and type this:
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline
02#Step — Let’s download the Breast Cancer Dataset from the Skit-Learn:
from sklearn.datasets import load_breast_cancercancer = load_breast_cancer()print(cancer.DESCR)[basically what this is it's 569 instances with 30 numeric attributes and the prediction that you want to do as far as the class is whether or not this tumor for this breast cancer is malignant or benign]
03#Step — Let’s see the dictionaries available:
cancer.keys()dict_keys(['data', 'target', 'target_names', 'DESCR', 'feature_names', 'filename'])
04#Step — Let’s create a Pandas’ Dataframe to work with:
df_feat = pd.DataFrame(data=cancer['data'], columns=cancer['feature_names'])df_feat.head(2)
df_feat.info()cancer.target_namesarray(['malignant', 'benign'], dtype='<U9')
05#Step — Train Test Split:
from sklearn.model_selection import train_test_splitX = df_featy = cancer['target']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)
06#Step —Let’s grab & train the support vector classifier mode:
from sklearn.svm import SVCmodel = SVC()model.fit(X_train, y_train) # Fit the model to the training dataSVC(C=1.0, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma='scale', kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
07#Step —Prediction:
predictions = model.predict(X_test)
08#Step — Confusion Matrix & Classification Report:
from sklearn.metrics import classification_report, confusion_matrixprint(confusion_matrix(y_test, predictions))print('/n')print(classification_report(y_test, predictions))
[[ 56 10]
[ 3 102]]
/n
precision recall f1-score support
0 0.95 0.85 0.90 66
1 0.91 0.97 0.94 105
accuracy 0.92 171
macro avg 0.93 0.91 0.92 171
weighted avg 0.93 0.92 0.92 171
Let’s get the graph:
from sklearn.metrics import confusion_matrixfrom sklearn.metrics import plot_confusion_matrixplot_confusion_matrix(model, X_test_scaled, y_test, values_format='d', display_labels=['malignant', 'benign'])
Analyzing Confusion Matrix:
From 56 + 10= 66 people that have malignant cancer, 10 was misclassified (15%)
From 3 + 102= 105 people that have benign cancer, 3 was misclassified (3%)
Let’s see if we can do better…
09#Step —Let’s use GridSearch & Train:
A grid search allows you to find the right parameters such as what C or gamma values to use and finding those right parameters is usually a tricky task.
But luckily we can be a little lazy and just try a bunch of combinations and see what works best.
from sklearn.model_selection import GridSearchCVparam_grid = {'C':[0.1,1,10,100,1000], 'gamma':[1,0.1, 0.01,0.001,0.0001]}grid_model = GridSearchCV(SVC(), param_grid, verbose=3)grid_model.fit(X_train, y_train)
grid_model.best_params_{'C': 1, 'gamma': 0.0001}
A large C value gives you low bias and high variance in the model or vice versa.
a large Gamma value is going to lead to a high bias and low variance in the model or vice versa.
So if the Gamma is large then the variance is small implying that the support vector does not have a widespread influence.
So that has to do with that bias-variance tradeoff.
grid_model.best_estimator_SVC(C=1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0, decision_function_shape='ovr', degree=3, gamma=0.0001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
10#Step — Predictions and Confusion Matrix & Classification Report:
grid_predictions = grid_model.predict(X_test)print(confusion_matrix(y_test, grid_predictions))print('/n')print(classification_report(y_test, grid_predictions))[[ 59 7]
[ 4 101]]
/n
precision recall f1-score support
0 0.94 0.89 0.91 66
1 0.94 0.96 0.95 105
accuracy 0.94 171
macro avg 0.94 0.93 0.93 171
weighted avg 0.94 0.94 0.94 171
Now, The Graph:
plot_confusion_matrix(grid_model, X_test, y_test, values_format='d', display_labels=['malignant', 'benign'])
Analyzing Confusion Matrix:
From 59 + 7= 66 people that have malignant cancer, 7 was misclassified (10%)
From 4+ 101= 105 people that have benign cancer, 4 was misclassified (3%)
OK! That’s all!
I hope you enjoyed that lecture.
If you find this post helpful, please click the applause button and subscribe to the page for more articles like this one.
Until next time!
I wish you an excellent day!
Download The File For This Project
Credits & References
Based on: Python for Data Science and Machine Learning Bootcamp by Jose Portilla
sklearn.datasets
.load_breast_cancer — The breast cancer dataset is a classic and very easy binary classification dataset. Download: skit-learn page
Related Posts
00#Episode — AISeries — ML — Machine Learning Intro — What Is It and How It Evolves Over Time?
01#Episode — AISeries — Huawei ML FAQ — How do I get an HCIA certificate?
02#Episode — AISeries — Huawei ML FAQ Again — More annotation from Huawei Mock Exam
03#Episode — AISeries — AI In Graphics — Getting Intuition About Complex Math & More
04#Episode — AISeries — Huawei ML FAQ — Advanced — Even More annotation from Huawei Mock Exam
05#Episode — AISeries — SVM — Credit Card — Start to Finished — A Complete Colab Notebook Using the Default of Credit Card Clients Data Set from UCI
06#Episode — AISeries — SVM — Breast Cancer — Start to Finished— A Complete Colab Notebook Using the Default of Credit Card Clients Data Set from UCI (this one)
07#Episode — AISeries — SVM — Cupcakes or Muffins? — Start To Finished — Based on Alice Zhao post
Pros & Cons of SVMPros
- Good at dealing with high dimensional data;
- Works well on small data sets.Cons
- Picking the right kernel and paramenters can be computationally intensive.
Classification TechniquesSVM is one of many Classification Tecniques - Logistic Regression
- K Nearest Neighbors
- Decision Tree
- Naives Bayes
- Neural NetworksAdvice: Try multiple tecniques on your data set.
Take the road less traveled!