Breast Cancer Classification: A Simple Guide with Scikit-Learn and Support Vector Machine (SVM)

Empowering Breast Cancer Diagnosis — A Practical Walkthrough Using Scikit-Learn and SVM

Simran Kaushik
9 min readNov 24, 2023

Breast cancer, a prevalent form of cancer primarily affecting women, emerges from the abnormal growth of cells within the breast tissue. Statistically, it is a widespread health concern, with millions of new cases reported globally each year. Early detection through routine screenings significantly improves the chances of successful treatment and recovery.

In the realm of breast cancer classification, tumors are categorized as either malignant or benign. Malignant tumors, statistically associated with a higher risk of morbidity, are cancerous and have the potential to invade surrounding tissues, leading to metastasis. The worldwide statistics underscore the urgency of accurate classification, as breast cancer ranks among the most diagnosed cancers, emphasizing the critical need for distinguishing between malignant and benign tumors.

Benign tumors, while not cancerous, are also prevalent and contribute to the spectrum of breast abnormalities. A nuanced understanding of these distinctions is vital for effective healthcare decision-making. Leveraging statistical insights and data-driven approaches in breast cancer diagnosis guides medical professionals in tailoring treatment plans, ensuring optimal outcomes for patients. The convergence of medical expertise and statistical analysis plays a pivotal role in navigating the complexities of breast cancer management, contributing to advancements in personalized and targeted therapies.

Support Vector Machine (SVM)

Support Vector Machine (SVM) is a machine learning algorithm used for classification and regression tasks. It works by finding the optimal hyperplane that best separates different classes in a high-dimensional space. The goal is to maximize the margin between classes, with data points closest to the hyperplane called support vectors. SVM is effective in handling complex datasets and is widely used in various applications, including image recognition, text classification, and bioinformatics.

Mathematics of SVM

Support Vector Machines (SVM) operate on the principle of finding the optimal hyperplane to separate different classes in a dataset. The key mathematical concept involves transforming the input data into a higher-dimensional space, where a hyperplane can be defined to maximize the margin between classes. The margin is the distance between the hyperplane and the nearest data points of each class. SVM aims to find the hyperplane that maximizes this margin, making it more robust to new, unseen data. The optimization process involves solving a mathematical problem to identify the support vectors, which are the data points crucial for determining the hyperplane. By effectively classifying data points based on their position in relation to this hyperplane, SVM achieves accurate and reliable predictions in classification tasks.

Practical Example in Daily Life

A practical example of Support Vector Machines (SVM) can be found in the field of medical diagnosis, particularly in distinguishing between patients with and without a particular medical condition based on various features.

Imagine a scenario where SVM is applied to classify whether a patient has a specific disease, like diabetes, using features such as blood glucose levels, BMI, age, and blood pressure. In the training phase, the SVM algorithm learns to create a hyperplane that effectively separates patients with diabetes from those without, considering the relationships and patterns within the given feature space.

Once trained, the SVM model can be used to evaluate new patient data. By examining the position of the patient’s features in relation to the learned hyperplane, the system can predict whether the individual is likely to have diabetes or not. This application of SVM demonstrates its effectiveness in medical decision-making, where accurate classification is crucial for timely and appropriate patient care.

Advantages and Disadvantages of Support Vector Machine

Advantages

  1. Effective in High-Dimensional Spaces: SVM performs well in situations where the number of dimensions (features) is high, making it suitable for applications such as image recognition and text classification. Its ability to find a hyperplane in a high-dimensional space allows it to handle complex datasets effectively.
  2. Robust to Overfitting: SVM is less prone to overfitting, especially in high-dimensional spaces. The margin maximization concept helps generalize well to new, unseen data. This is particularly beneficial when working with limited datasets or when dealing with noisy data.
  3. Versatility with Kernels: SVM can handle non-linear decision boundaries through the use of kernel functions. By transforming the input data into a higher-dimensional space, SVM can effectively deal with complex relationships in the data, providing flexibility in capturing intricate patterns.

Disadvantages

  1. Sensitivity to Parameter Tuning: SVM performance is sensitive to the choice of parameters, including the kernel type and the cost parameter (C). Selecting appropriate values for these parameters can be challenging and may require fine-tuning, making SVM somewhat sensitive to parameter choices.
  2. Computational Intensity: Training an SVM model can be computationally intensive, especially when dealing with large datasets. As the size of the dataset increases, the training time and memory requirements also grow, which can be a limitation in applications where real-time processing is essential.
  3. Limited Interpretability: The resulting hyperplane from SVM may be difficult to interpret, particularly in high-dimensional spaces. While SVM provides an effective means of classification, understanding the decision-making process in terms of the original features can be challenging, limiting the model’s interpretability in some cases.

Different types of kernel used in SVM

Support Vector Machines (SVM) use kernels to transform input data into a higher-dimensional space, allowing the algorithm to find nonlinear decision boundaries. Here are some common types of kernels used in SVM:

  1. Linear Kernel: The linear kernel is the simplest and most basic kernel. It represents the dot product of the input features in the original space. While it is suitable for linearly separable data, its real strength lies in its computational efficiency, making it a good choice for large datasets.
  2. Polynomial Kernel: The polynomial kernel extends the linear kernel by including polynomial terms, allowing the SVM to capture nonlinear relationships. It introduces a user-defined degree parameter that determines the degree of the polynomial.
  3. Radial Basis Function (RBF) or Gaussian Kernel: The RBF kernel is commonly used for handling nonlinear data. It transforms data into an infinite-dimensional space and is characterized by a length scale parameter (gamma). It is a popular choice due to its flexibility in capturing complex patterns.
  4. Sigmoid Kernel: The sigmoid kernel is based on the hyperbolic tangent function. It is suitable for data that may not be linearly separable and is often used in neural network applications.

Choosing the appropriate kernel depends on the characteristics of the data and the problem at hand, and it often involves some experimentation to determine the best-performing kernel for a specific task.

By default, the Support Vector Machine (SVM) in many libraries, including Scikit-Learn, typically uses the Radial Basis Function (RBF) kernel, also known as the Gaussian kernel. This default choice is because the RBF kernel is versatile and can effectively handle a wide range of data distributions, making it a suitable starting point for various classification tasks.

Libraries Utilized in This Project

For this project, following libraries will be leveraged to facilitate various aspects of our work:

  1. NumPy: For numerical operations and array handling (1.23.5).
  2. Pandas: To manipulate and analyze structured data efficiently (1.5.3).
  3. Matplotlib: For creating visualizations and plots (3.3.4).
  4. Seaborn: To enhance the aesthetics of our visualizations built on top of Matplotlib (0.11.1).
  5. Scikit-learn: A comprehensive machine learning library for model building and evaluation (1.2.2).

Breast Cancer Classification using Support Vector Machine (SVM)

Embark on an exploration delving into the nuanced process of classifying Breast Cancer through SVM. This tutorial unfurls with a meticulously planned sequence of strategic steps:

  • Data Collection
  • Exploratory Data Analysis (EDA)
  • Model Selection
  • Model Training
  • Model Prediction
  • Model Evaluation

Data Collection

To demonstrate SVM, dataset has been taken from scikit-learn. It is a versatile machine learning library in Python, which offers a collection of “toy datasets” that serve as invaluable resources for learning, testing, and prototyping machine learning algorithms.

In this tutorial, Breast Cancer Wisconsin dataset is used which can be directly loaded using scikit-learn.

Step 1: Installation of the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Step 2: Load the dataset

from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()

The data is structured as a dictionary, comprising eight keys, each with its corresponding set of values. This dictionary-based organization allows for a systematic representation of information, making it accessible and easily retrievable. Each key serves as a unique identifier, offering a structured approach to navigate and extract specific details associated with the dataset.

To get the keys present in the dataset, use .keys() function which will return a list of the keys present in the dataset.

cancer.keys()

To have a look at the target variable, features and shape of the dataset, enter the following command.

print(cancer['target_names'])
print(cancer['feature_names'])
print((cancer['data']).shape)

Step 3: Creation of the dataset in the required format

df_cancer = pd.DataFrame(np.c_[cancer['data'],cancer['target']], columns = np.append(cancer['feature_names'],['target']))

Since the dataframe is now concatenated with the target variable, the shape of the dataset is changed. Use .shape() to look at the size of the dataframe. It is now (569, 31).

Step 4: Use .head() and .tail() function to have a glimpse of the dataset

The .head() will return initial 5 rows and .tail() will return last 5 rows. To look at more rows, specify the number of rows, like dataframe_name.head(10), this will return initial 10 rows of the dataframe.

The head of the dataframe

Exploratory Data Analysis (EDA)

EDA is a crucial initial phase where practitioners examine and visualize the dataset to understand its key characteristics and patterns. EDA involves summarizing the main features of the data, identifying potential outliers, and uncovering relationships between variables. By visualizing distributions, correlations, and trends, EDA informs subsequent steps in the machine learning pipeline, helping to select appropriate models and preprocessing techniques. Ultimately, EDA acts as a compass, guiding data scientists in making informed decisions and ensuring a deeper comprehension of the data before diving into more complex machine learning tasks.

Let’s start by plotting a pairplot using seaborn to have comprehensive overview of the relationships across multiple features.

sns.pairplot(df_cancer, hue = 'target', vars = ['mean radius', 'mean texture', 'mean area', 'mean perimeter', 'mean smoothness'])

Additionally, lets delve into the data more by using countplot, scatterplot and heatmap to understand the relationships more.

sns.countplot(df_cancer['target'], label = "Count") 
sns.scatterplot(x = 'mean area', y = 'mean smoothness', hue = 'target', data = df_cancer)
plt.figure(figsize=(20,10)) 
sns.heatmap(df_cancer.corr(), annot=True)

Model Selection

Support Vector Machines (SVM) are often preferred for breast cancer classification due to their effectiveness in handling complex, high-dimensional datasets. In the context of breast cancer, where the number of features can be substantial, SVM excels in finding an optimal hyperplane to separate malignant and benign cases. The ability of SVM to handle non-linear relationships through kernel functions is crucial when dealing with intricate patterns in cancer data. Moreover, SVM’s robustness and ability to generalize well make it a suitable choice for medical applications, where accurate classification is paramount. The versatility of SVM in capturing subtle distinctions in data, coupled with its strong performance on diverse datasets, makes it a favored algorithm in breast cancer classification scenarios.

Model Training

In the context of breast cancer classification using SVM, the model training process involves utilizing a portion of the dataset to train the model on the relationships between the input features and the target variable.

Step 1: Get the values of independent and dependent variables

X = df_cancer.drop(['target'],axis=1)
y = df_cancer['target']

Step 2: Splitting the Data

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state=5)

Step 3: Importing and Training the Model

from sklearn.svm import SVC 
svc_model = SVC()
svc_model.fit(X_train, y_train)

Model Prediction

Once the model is trained, use it to make classifications on the testing set.

y_predict = svc_model.predict(X_test)

Model Evaluation

Evaluate the model’s performance using appropriate measures, such as accuracy, precision and recall, to assess how well it classified the cancer as ‘malignant’ or ‘benign’ on the testing set.

from sklearn.metrics import classification_report, confusion_matrix
cm = confusion_matrix(y_test, y_predict)
sns.heatmap(cm, annot=True)
print(classification_report(y_test, y_predict))

🥳Congratulations 🥳, Breast Cancer Classification project is now ready!

Thank you for exploring this tutorial! If you found it helpful, please consider liking, sharing, and subscribing for more blogs in the future. Stay tuned for additional insights and guides! For updates, you can also follow me on LinkedIn.

--

--

Simran Kaushik

I am an Analyst at KPMG and a participant of UiPath SDC 2024. Leveraging my expertise, I authored two influential books on Django and Machine Learning.