ML Series: Day 14 — Logistic Regression (Part 3 — implementation)

Ebrahim Mousavi
13 min readMay 9, 2024

In this blog post, we will explore the implementation of logistic regression using Python and the scikit-learn library (sklearn).

Notice:

For the implementation of logistic regression for the ‘breast cancer wisconsin’ dataset, you can find the complete code in my GitHub repository. Please visit the repository to explore the code and refer to the detailed description provided.

First, we load the ‘breast cancer wisconsin’ dataset into a pandas DataFrame using the `read_csv` function from the pandas library. The dataset is stored in a CSV file named ‘breast-cancer-wisconsin.csv’. To examine the first 5 rows of the DataFrame and get a glimpse of the dataset, we use the `head(5)` function.

df = pd.read_csv('Datasets/classification/breast-cancer-wisconsin.csv')
df = df.iloc[:, 1:] # for remove the first unnecessary feature (id)
df.head(5)
Fig 1. Examine the first 5 rows of the DataFrame

Exploratory Data Analysis

In this section, we will perform an exploratory data analysis on the breast cancer Wisconsin dataset. This will involve examining the data for any null values and performing necessary data preprocessing steps.

1. Looking for null values

To ensure the integrity of our data, we will first investigate the presence of any null values. By carefully examining the dataset, we can identify missing values and decide on appropriate strategies for handling them.

df.isnull().sum()

By calling the isnull() function on the DataFrame, it generates a boolean mask where True indicates a null value and False indicates a non-null value. Then, the sum() function is applied to calculate the sum of True values for each column, effectively giving the count of null values per column.

The DataFrame `df` is free from any null values, ensuring that the dataset is complete and does not require any further handling or imputation for missing values.

2. Data Preprocessing

After identifying any null values, we will proceed with data preprocessing. The ‘diagnosis’ feature in the dataset is categorical, and it is necessary to convert it into numerical format to enable compatibility with various machine learning algorithms. By converting it to numerical values, we can perform mathematical and statistical operations on the feature.

# Replace M with 1 and Begnin with 0 (else 0)
print("Malignant=1, Benign=0")
df["diagnosis"]= df["diagnosis"].map(lambda row: 1 if row=='M' else 0)
df.head()

The provided code accomplishes this conversion by replacing ‘M’ with 1 for malignant cases and ‘B’ with 0 for benign cases.

3. Unique Values

To gain insights into the dataset, we will examine the unique values present in different columns. This step allows us to understand the distinct categories or levels within each feature and identify any potential data inconsistencies.

print("The unique number of data values are")
df.nunique()

4. Data Spread between the two types (Malignant, Benign)

We will assess the distribution of data between malignant and benign cases in the ‘diagnosis’ column. By analyzing the count or proportion of each type, we can gain an understanding of the class imbalance and determine if any further steps, such as data balancing techniques, are required.

import seaborn as sns

diagnosis_counts = df["diagnosis"].value_counts()
mean_diagnosis = df["diagnosis"].mean()
total_data_points = len(df)

print("Total number of data points =", total_data_points)
print("Malignant (diagnosis = 1) = {:.2f}%".format(mean_diagnosis * 100))
print("Benign (diagnosis = 0) = {:.2f}%".format((1 - mean_diagnosis) * 100))

sns.countplot(data=df, x="diagnosis")
plt.ylabel("Number of data points")
plt.title("Malignant (1) vs Benign Data(0) points")
plt.show()

First, the value counts of each unique diagnosis (malignant or benign) are calculated using df["diagnosis"].value_counts(). This provides the count of data points for each diagnosis category.

The mean of the “diagnosis” column is computed using df["diagnosis"].mean(), which represents the proportion of malignant cases in the dataset.

Then, a countplot is generated using sns.countplot() to visualize the distribution of data points for each diagnosis category.

Fig 2. Malignant (1) vs Benign(0)

5. Feature Selection and dimensionality reduction

Feature selection and dimensionality reduction are two common techniques used in machine learning and data analysis.

Feature selection involves selecting a subset of relevant features from the original set of features. It aims to eliminate irrelevant or redundant features, which can improve model performance, reduce overfitting, and enhance interpretability.

Dimensionality reduction, on the other hand, focuses on transforming the original high-dimensional dataset into a lower-dimensional space while retaining the most important information. This can help address the curse of dimensionality, reduce computational complexity, and visualize data in a more manageable manner.

In a real project, either feature selection or dimensionality reduction can be chosen based on the specific goals and requirements. Both techniques contribute to improving model performance and reducing complexity, but the selection depends on the nature of the dataset and the problem at hand.

5–1. Feature Selection

In order to identify relevant features for our analysis, we will utilize a correlation matrix. This matrix provides a measure of the linear relationship between each pair of features. By examining the correlations, we can identify highly correlated features and potentially select a subset of features that are most informative for our logistic regression model.

corr = df.corr()
corr[['diagnosis']].abs().sort_values(by='diagnosis', ascending=False)

The variable `corr` stores the correlation matrix computed using `df.corr()`. This matrix contains the correlation coefficients between all pairs of columns in the DataFrame.

By selecting the column ‘diagnosis’ from the correlation matrix using `corr[[‘diagnosis’]]`, and applying the `abs()` function to calculate the absolute values of the correlations, we obtain a DataFrame with the absolute correlation values between the “diagnosis” column and other features.

The `sort_values()` function is used to sort the DataFrame in descending order based on the absolute correlation values of the “diagnosis” column.

Fig 3. Descending order based on the absolute correlation values of the “diagnosis” column

Feature Selection and Correlation Matrix Visualization for Breast Cancer Diagnosis

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.feature_selection import SelectKBest, f_regression, r_regression
import numpy as np

# Separate the features (X) and the target variable (y)
X = df.drop('diagnosis', axis=1)
y = df['diagnosis']

# Perform feature selection
'''
`SelectKBest` method is used for feature selection.
It selects the top k features based on a scoring function
'''
selector = SelectKBest(score_func=f_regression, k=3)
selected_features = selector.fit_transform(X, y)

# Get the indices of the selected features
selected_indices = selector.get_support(indices=True)

# Get the names of the selected features
selected_feature_names = X.columns[selected_indices]

# Print the selected feature names
print("Selected Features:")
print(selected_feature_names) # 'concave points_mean', 'perimeter_worst', 'concave points_worst'

# Create a DataFrame with the selected features
selected_df = X[selected_feature_names]

# Calculate the correlation matrix between selected features and target
correlation_matrix = selected_df.join(y).corr()
new_df = selected_df.join(y)

# mask the upper triangle of the correlation matrix, as it is redundant.
mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))

# Create a heatmap to visualize the correlation matrix
plt.figure(figsize=(8, 6))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', mask=mask)
plt.title('Correlation Matrix between Selected Features and Target')
plt.show()
Fig 4. Correlation matrix between the selected features and the target variable.

For more information:

mask = np.triu(np.ones_like(correlation_matrix, dtype=bool))
print(mask)
Fig 5. Upper triangular matrix
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', mask=mask)

This is the documentation of seaborn library for plotting heatmap [Link]

mask: bool array or DataFrame, optional

If passed, data will not be shown in cells where mask is True.

This is the new dataframe that we obtain after feature selection stage:

new_df.head()
Fig 6. Examine the first 5 rows of the new DataFrame

5–2. Dimensionality reduction (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms high-dimensional data into a lower-dimensional space while preserving the most significant patterns and variability in the data.

from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

X = df.drop(['diagnosis'], axis=1)
y = df['diagnosis']

X_scaled = StandardScaler().fit_transform(X)

pca = PCA(n_components=2)
X_pca_scaled = pca.fit_transform(X_scaled)

plt.figure(figsize=(10, 8))
plt.scatter(X_pca_scaled[:, 0], X_pca_scaled[:, 1], c=df['diagnosis'], alpha=0.7, s=30, cmap='plasma')
plt.xlabel('PCA component 1')
plt.ylabel('PCA component 2')
plt.title('PCA projection of Breast Cancer Dataset')

First, the features (X) and the target variable (y) are separated from the DataFrame. To prepare the data for PCA, the features are scaled using the StandardScaler from scikit-learn. `StandardScaler().fit_transform(X)` scales the features to have zero mean and unit variance, resulting in X_scaled.

PCA is initialized with `PCA(n_components=2)`, indicating that we want to project the data onto two principal components. The `pca.fit_transform(X_scaled)` line fits the PCA model to the scaled features and performs the dimensionality reduction, resulting in X_pca_scaled, which contains the projected data.

Finally, a scatter plot is created to visualize the projected data. The x and y coordinates of the scatter plot are taken from `X_pca_scaled[:, 0]` and `X_pca_scaled[:, 1]`, respectively. The color of each data point is determined by the ‘diagnosis’ column using `c=df[‘diagnosis’]`, with different colors representing different diagnosis categories. Additional parameters such as transparency (alpha), marker size (s), and color map (cmap) are set to customize the plot appearance.

Fig 7. Visualization of the data in a two-dimensional space

6. Select two important features of data

Calculate the correlation matrix (corr) for the DataFrame new_df and extracted the top three rows based on the absolute correlation values for the 'diagnosis' column.

corr = new_df.corr()
top_three_corr = corr[['diagnosis']].abs().sort_values(by='diagnosis', ascending=False).head(3)

print(top_three_corr)
'''
diagnosis
diagnosis 1.000000
concave points_worst 0.793566
perimeter_worst 0.782914
'''
new_df.head()

Before removing ‘concave points_mean’ feature:

Fig 8. Before removing ‘concave points_mean’ feature

Removing referred feature:

new_df = new_df.drop(columns=['concave points_mean'], axis=1) 
new_df.head()

Some rows of dataset:

Fig 9. After removing ‘concave points_mean’ feature

7. Rename the column names so that they can be codeble

new_df.rename(columns={'concave points_worst': 'concave_points_worst'}, inplace=True)
new_df.head()

Some rows of dataset after rename:

Fig 10. After renaming

Data Normalization

Data normalization is a crucial preprocessing step in machine learning to ensure that the numerical variables in a dataset are transformed to a common scale. This helps in avoiding bias towards variables with larger magnitudes and allows for fair comparisons between different variables. We will explore two common methods for data normalization:

Normalization and standardization.

1. Normalization

Normalization, also known as min-max scaling, rescales the data to a range between 0 and 1. This method is particularly useful when the distribution of the data is unknown or when there are outliers present. Here are two way for implement normalization:

Method 1: Custom Function

def normalization(data, train_data):
min_value = np.min(train_data, axis=0)
max_value = np.max(train_data, axis=0)
normalized_data = (data - min_value) / (max_value - min_value)
return normalized_data

# Assuming train_data and test_data are the respective train and test sets
normalized_train = normalization(train_df, train_df)
normalized_test = normalization(test_df, train_df)

Within the function, the minimum and maximum values are computed column-wise using np.min and np.max functions along the specified axis (axis=0), ensuring that the minimum and maximum values are obtained for each feature independently. The dataset data is then normalized by subtracting the minimum values and dividing by the range (difference between maximum and minimum values) for each feature.

To apply the normalization to separate train and test sets, the function is called twice: once for the training set (train_df) with train_df passed as both data and train_data, and once for the test set (test_df) with test_df passed as data and train_df passed as train_data. This ensures that the normalization is performed based on the minimum and maximum values calculated from the training set.

The min, max, and std of the train dataset before applying normalization:

print("Minimum values:")
print(train_df.min())
print()

print("Maximum values:")
print(train_df.max())
print()

print("Standard deviation:")
print(train_df.std())
Fig 11. Min, max, and std of the train dataset before applying normalization

The min, max, and std of the train dataset after applying normalization:

print("Minimum values:")
print(normalized_train.min())
print()

print("Maximum values:")
print(normalized_train.max())
print()

print("Standard deviation:")
print(normalized_train.std())
Fig 12. Min, max, and std of the train dataset after applying normalization

Method 2: MinMaxScaler

Alternatively, scikit-learn provides the `MinMaxScaler` class, which simplifies the process of normalization. It scales the data to a specified range, typically between 0 and 1, by subtracting the minimum value and dividing by the range (maximum — minimum). This method is particularly useful when applying the same scaling to multiple datasets or when using pipelines in scikit-learn.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()

# Fit the scaler on the train DataFrame
scaler.fit(train_df)

# Normalize the train DataFrame
normalized_train = scaler.transform(train_df)

# Normalize the test DataFrame using the fitted scaler
normalized_test = scaler.transform(test_df)

print(normalized_train.min(), normalized_train.max(), normalized_test.min(), normalized_test.max())
# (0.0, 1.0, -0.0207411926185756, 1.0)

The output type of the `fit_transform` method of `MinMaxScaler` is a numpy array. The fit method is then applied to train_df to calculate the minimum and maximum values required for normalization. Subsequently, the transform method is used to normalize both the train and test DataFrames using the same scaler object. This ensures that the normalization is applied consistently across both datasets based on the minimum and maximum values obtained from the training set.

2. Standardization

Standardization, also known as z-score normalization, transforms the data to have a mean of 0 and a standard deviation of 1. Standardization transforms the features of a dataset to have zero mean and unit variance, making it follow a standard normal distribution. It is commonly used in neural networks to ensure that the input features have a consistent scale and facilitate the learning process.. Here are two way for standardization:

Method 1: Custom Function

The provided code snippet demonstrates a custom function, `standardization(x)`, which takes a variable (e.g., a column of a DataFrame) as input and returns the standardized data. It calculates the standardized values using the formula `(x — x.mean()) / x.std()`, where `x.mean()` and `x.std()` represent the mean and standard deviation of the variable, respectively.

def standardization(data, train_data):
mean_value = np.mean(train_data, axis=0)
std_value = np.std(train_data, axis=0)
standardized_data = (data - mean_value) / std_value
return standardized_data

# standardized or simply normalized dataframe
standardized_train = standardization(train_df, train_df)
standardized_test = standardization(test_df, train_df)

Visualization of Minimum and Maximum Normalized Values for Each Feature

import matplotlib.pyplot as plt

# Calculate the minimum and maximum values for each feature
min_values = standardized_train.min()
max_values = standardized_train.max()

# Create a bar plot
plt.figure(figsize=(8, 6))
plt.bar(min_values.index, min_values, label='Min')
plt.bar(max_values.index, max_values, label='Max')
plt.xlabel('Features')
plt.ylabel('Normalized Values')
plt.title('Minimum and Maximum Values for Each Feature')
plt.legend()
plt.show()
Fig 13. Minimum and maximum values for each feature

Method 2: StandardScaler

Alternatively, scikit-learn provides the `StandardScaler` class, which simplifies the process of standardization. It standardizes the data by subtracting the mean and dividing by the standard deviation.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

standardized_train = scaler.fit_transform(train_df)
standardized_test = scaler.transform(test_df)

Hyperparameter Tuning and Training with GridSearchCV

GridSearchCV is a technique in machine learning that systematically searches for the best combination of hyperparameters for a given model.

I want to use Normalization method:

new_df = new_df.drop(columns=['concave points_mean'], axis=1) 
train_df, test_df = train_test_split(new_df, test_size=0.2, random_state=42)

normalized_train = normalization(train_df, train_df)
normalized_test = normalization(test_df, train_df)

# Extract features and target variable columns from normalized_train, and normalized_test
x_train = normalized_train.drop(['diagnosis'], axis=1).values
y_train = normalized_train['diagnosis'].values

x_test = normalized_test.drop(['diagnosis'], axis=1).values
y_test = normalized_test['diagnosis'].values

print(x_train.shape, y_train.shape, x_test.shape, y_test.shape)
# (455, 2) (455,) (114, 2) (114,)

The process of hyperparameter tuning using GridSearchCV for a Logistic Regression model:

from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

# Define the parameter grid
param_grid = {
'C': [0.01, 0.1, 1, 2, 5],
'solver': ['lbfgs', 'liblinear', 'newton-cg', 'sag', 'saga']
}

clf = LogisticRegression()
model = GridSearchCV(clf, param_grid)
model.fit(x_train, y_train)
best_clf = model.best_estimator_

print(best_clf)
# LogisticRegression(C=5, solver='liblinear')

A parameter grid, `param_grid`, is defined which specifies the different values to be explored for the hyperparameters of the Logistic Regression model. In this example, the hyperparameters being tuned are ‘C’ (the inverse of regularization strength) and ‘solver’ (the algorithm used for optimization). Various values for ‘C’ and ‘solver’ are provided in the parameter grid.

The `GridSearchCV` object performs an exhaustive search over the specified parameter grid, evaluating the model’s performance using cross-validation.

The `fit()` method triggers the grid search process, where the model is trained and evaluated for each combination of hyperparameters in the grid. After the grid search is complete, `model.best_estimator_` returns the best estimator, i.e., the Logistic Regression model with the optimal combination of hyperparameters. This best model can be assigned to `best_clf` for further use.

Classification Report for Model Evaluation

The classification_report function in the scikit-learn library is a powerful tool for evaluating the performance of a classification model. It provides a comprehensive summary of various evaluation metrics, including precision, recall, F1-score, and support, for each class in the classification task.

The classification report takes the true labels (ground truth) and predicted labels as inputs and calculates the evaluation metrics for each class.

Precision measures the proportion of correctly predicted positive instances out of all instances predicted as positive, while recall measures the proportion of correctly predicted positive instances out of all actual positive instances. The F1-score is the harmonic mean of precision and recall, offering a balanced measure of the model’s accuracy.

The support metric represents the number of instances belonging to each class in the true labels. It can help identify imbalanced classes or provide insights into the distribution of the data.

from sklearn.metrics import classification_report

y_pred = best_clf.predict(x_test)
report = classification_report(y_test, y_pred)

print("Classification Report:")
print(report)
Fig 14. Classification Report

Visualization of Decision Boundary

The plot_decision_regions function from the mlxtend.plotting module provides a convenient way to visualize the decision boundary of a classification model. It helps in understanding how the model separates different classes based on the provided features.

from mlxtend.plotting import plot_decision_regions

fig, ax = plt.subplots(figsize=(6, 6))
plot_decision_regions(x_train, y_train.astype(np.int_), clf=best_clf)

ax.set_xlabel("Feature 1")
ax.set_ylabel("Feature 2")
ax.set_title("Decision Boundary")
plt.show()

The clf parameter is set to best_clf, which represents the trained classifier model. This allows the function to use the trained model to plot the decision boundary based on the provided feature data.

Fig 15. Decision boundary

In Day 14, we talked about implementing Logistic regression for a classification task and described all the steps involved. We focused on data preparation, model training, evaluation, and visualization. In my upcoming post, Machine Learning Series: Day 15 — KNN — Part 1 we will discuss about KNN.

--

--