Binary Classification for Kaggle competition: SVM, LightGBM, Decision Tree, Gradient Boosting, feature engineering, and CatBoost.

Published in

The Deep Hub

12 min readNov 11, 2023

In this article, I will code for the Kaggle coding competition, which happens regularly and is hosted by Kaggle. The competition's name is the Playground series, in which Kaggle comes up with new datasets and team code to get the best accuracy to stay on top of the leaderboard. Participating in these competitions is a great way to learn machine learning. This article is a mix of codes and explanations of ML algorithms, and I would explain everything in as much detail as possible. I would appreciate any feedback.

The goal of this article is to learn data analysis, data engineering, and Exploratory Data Analysis (EDA). Then I will find the best algorithm that will give the best accuracy. I apologize for any mistakes and would like any feedback I can get.

This article is structured into the following subsections:

Data Loading and Analysis
Support Vector Machine(SVM)
Data Preprocessing and Visualization
LightGBM
Decision Tree Classifier
Gradient Boosting Algorithm
Feature Engineering
Principal Component Analysis
CatBoost Classifier

Here we go…

1. Data Loading and Analysis

I will start with importing the necessary libraries.

from sklearn.model_selection import train_test_split
import seaborn as sns
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler

# Evaluation Metrics
from sklearn.metrics import f1_score, roc_auc_score, log_loss, confusion_matrix, accuracy_score
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

Data Analysis
We will analyze our data and figure out what type of data we have and how many data rows we have.

train_df_path = "/kaggle/input/playground-series-s3e24/train.csv"

train_df = pd.read_csv(train_df_path)
train_df_head = train_df.head(10)
train_df_head = train_df_head.transpose()
train_df_head

From above, we can see ten entries from the data frame, and as the data is too much, it does not fit onto the screen, so I transposed it as we can see from the above data. It has an id column, which has the data index. The most common column names are age, height, weight, waist, eyesight, hearing, and dental caries. At the same time, other column names are more scientific and need advanced domain knowledge. So let us try to find out more about column names:

systolic — the pressure in arteries when the heart beats.
Fasting blood sugar — A fasting blood sugar level of 99 mg/dL or lower is expected. 100 to 125 mg/dL indicates one have prediabetes, and 126 mg/dL or higher indicates one has diabetes.
Cholesterol — a total cholesterol level of less than 200 mg/dL (5.17 mmol/L) is considered normal.
triglyceride — Triglycerides are a type of fat, called lipids, that circulate in blood.
HDL — HDL (high-density lipoprotein) cholesterol, sometimes called “good” cholesterol, absorbs cholesterol in the blood and carries it back to the liver.
LDL — LDL (low-density lipoprotein) cholesterol, sometimes called “bad” cholesterol, makes up most of your body’s cholesterol. High levels of LDL cholesterol raise your risk for heart disease and stroke.
hemoglobin — the protein contained in red blood cells that is responsible for the delivery of oxygen to the tissues.
Urine protein — protein in urine test measures a protein in the urine (pee), and a large amount indicates kidney problems.
serum creatinine — waste product produced by the muscles and filtered out by the kidneys. Large amounts indicated problems with the kidneys.
AST — Too much indicates liver damage.
ALT — increased ALT level is often a sign of liver disease.

Now that we have basic knowledge of the data we have, we will check the statistics of the data.

2. Support Vector Machine(SVM)

I will train a very simple model without doing any analysis and data engineering to see the differences of approaches I am gonna take along this notebook.

I am gonna train an SVM for this approach. Now, a little(not so) about Support Vector Machines, which is a supervised Machine Learning algorithm. SVM can be used for both classification and Regression. As we can see in the above picture, there are two classes (circle and square), which are separated by a hyperplane. SVM finds the optimal hyperplane that separates data points into two different classes.

SVM’s goal is to maximize the margin while minimizing the classification errors, which is why they are also known as maximum margin classifiers. Data points at which hyperplanes lie are called support vectors as they “support” the decision boundary.

There are two classifiers: Hard Margin classifier and Soft Margin Classifier. Hard Margin Classifiers are SVM which are strict and do not ignore outliers, while soft margin classifiers allow outliers. The amount of tolerance for SVM changes using regularization hyperparameter C.

SVM offers many different algorithms to use, which are called kernel or kernel methods (kernel functions), to solve nonlinear problems by linear classifiers. The kernel is a similarity function, which is provided to a machine learning algorithm to calculate the similarity of inputs.

The most common types of kernels are:
- Linear kernel
- Polynomial kernel
- Gaussian kernel
- Exponential kernel
- Anova radial basis kernel
- Hyperbolic or sigmoid kernel
- Laplacian kernel

X = train_df.drop(columns=["id", "smoking"])
y = train_df["smoking"]

train_X, test_x, train_y, test_y = train_test_split(X, y, test_size=0.2)

clf = SVC()
clf.fit(train_X, train_y)
svc_p = clf.predict(test_x)
print(f"Accuracy Score - {accuracy_score(test_y, svc_p)}, Area Under Curve - {roc_auc_score(test_y, svc_p)}")

We are getting 0.76 accuracy without any data preprocessing and feature engineering. We have to experiment with multiple algorithms to get better results.

3. Data Preprocessing and Visualization

Let us start analyzing the data, and to do that, we need to see the column's names. As we can see, column names have spaces and brackets in them. We need to change the column names so that we don’t face errors while dealing with columns. I have written a loop to go through the column names and make a dictionary with new column names.

# Formatting the column names

col = train_df.columns
print(f"Column names before changing - {col}")

columns = {}
for c in col:
    if "(" in c:
        new_c = c.replace("(", "_")
        new_c = new_c.replace(")", "")
        columns[c] = new_c
    elif " " in c:
        columns[c] = c.replace(" ", "_")

train_df = train_df.rename(columns=columns)

updated_col = train_df.columns 
print("-"*100)
print(f"Column names after changing - {updated_col}")

Now we will analyze the unique values in each column, and if columns have 20 or less than 20 unique values, I will print to analyze the values. I am looping on each column to select unique values.

# Printing unique values
for c in updated_col:
  print("{:<50} - {:15}".format("Total Unique values in "+c, len(train_df[c].unique())))
  if len(train_df[c].unique()) < 21:
    print(train_df[c].unique())
  print("_"*70)

Now I will visualize the data using the plots for further analysis of data. ‘weight’ and ‘height’ are common feature sets, I would create a relational plot using Seaborn to visualize the relationship between “height” and “weight”. The plot is color-coded based on the “smoking” variable using a blue-red palette. Additionally, it uses features in columns and rows to further visualize the data by “urine_protein” and “dental_caries”. The size of the markers is determined by a specified range (40 to 400).

sns.relplot(
    x="height_cm", 
    y="weight_kg", 
    hue="smoking", 
    data=train_df,
    palette=["b", "r"],
    col="Urine_protein",
    row="dental_caries",
    sizes=(40, 400))

I have plotted a few histoplots using seaborn library, if only one axes is passed, it will take y axis as count of feature. I have used hue as “smoking” for differentiate between smoker and non smoker. To show two variables side by side, multiple parameter is used with “dodge”. I have also increased their size by shrink parameter which handles the size of charts for better visuals.

# Plotting age
fig, [[ax1, ax2], [ax3, ax4], [ax5, ax6], [ax7, ax8], [ax9, ax10], [ax11, ax12]] = plt.subplots(6, 2, figsize=(25, 25))

sns.histplot(x="age", data=train_df, hue="smoking", multiple="dodge", ax=ax1, shrink=5)
sns.histplot(x="height_cm", data=train_df, hue="smoking", multiple="dodge", ax=ax2, shrink=5)
sns.histplot(x="weight_kg", data=train_df, hue="smoking", multiple="dodge", ax=ax3, shrink=5)
sns.histplot(x="waist_cm", data=train_df, hue="smoking", multiple="dodge", ax=ax4, shrink=5, element="poly")

sns.histplot(x="systolic", data=train_df, hue="smoking", multiple="dodge", ax=ax5, shrink=5)
sns.histplot(x="relaxation", data=train_df, hue="smoking", multiple="dodge", ax=ax6, shrink=5)
sns.histplot(x="fasting_blood_sugar", data=train_df, hue="smoking", multiple="dodge", ax=ax7, shrink=5)
sns.histplot(x="Cholesterol", data=train_df, hue="smoking", multiple="dodge", ax=ax8, shrink=5)

sns.histplot(x="triglyceride", data=train_df, hue="smoking", multiple="dodge", ax=ax9, shrink=5)
sns.histplot(x="HDL", data=train_df, hue="smoking", multiple="dodge", ax=ax10, shrink=5)
sns.histplot(x="LDL", data=train_df, hue="smoking", multiple="dodge", ax=ax11, shrink=5)
sns.histplot(x="hemoglobin", data=train_df, hue="smoking", multiple="dodge", ax=ax12, shrink=5)

I will make a heatmap of correlation matrix, a correlation matrix help to find the dependence of target variable with independent variables. It help to determine both extreme of dependency, which is either directly proportional or inversely proportional to target variable. In this case target variable is “smoking”.

sns.heatmap(train_df.corr())

# scaling data using vanilla standard scaler 
scaler = StandardScaler()
X = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

4. LightGBM

LightGBM is a gradient boosting framework that uses tree tree-based learning algorithm, LightGBM extends the tree vertically, meaning extending the tree through the leaves. LightGBM will extend the leaf which has maximum delta loss so that it can reduce more loss. “Light” in LightGBM stands for its fast speed; it can handle large amounts of data while taking lower memory to run. LightGBM largely focuses on accuracy and is not advised for use with small datasets because of its tendency to overfit.

Important parameters for LightGBM:
max_depth — limit the max depth of the tree, used to deal with overfitting
min_data_in_leaf — minimal number of data in one leaf, used to deal with overfitting
feature_fraction — random selection of a subset of features on each iteration, used to speed up training and deal with overfitting
early_stopping_round — stop training if there is no improvement in the metric of validation data
lambda — regularization parameter (lambda_l1, lambda_l2)

Core Parameters:
Task — can be train, predict, convert_model, etc
Objective — can be regression, regression_l1, huber, fair, poisson, quantile, mape, gamma, Tweedie, binary, multiclass, multiclassova, cross_entropy, cross_entropy_lambda, lambdarank, rank_xendcg, etc.
boosting — can be gbdt(gradient boosting decision trees), rf(random forest), dart(Dropout meet Multiple Additive Regression Trees)
num_boost_round — Number of boosting iterations
learning_rate — shrinkage rate
num_leaves — number of leaves in tree

Metric parameters:
Metrics specify the loss of the model and are one of the most important parameters of model training. Some most common metrics are Mean Absolute Error (MAE), Mean Square Error(MSE), binary_logloss, multi_logloss

IO Parameters:
max_bin — maximum number of bins that feature value will bucket in
categorical_feature — denote the index of the column with categorical features in datasets(example — [0, 1, 2] means index 0, 1, and 2 contain categorical features.)
ignore_column — indices of columns which need to be ignored

y_pred = gbm_clf.predict(X_test)

# convert into binary values
for i in range(len(y_pred)):
    if y_pred[i]>=0.5:
        y_pred[i]=1
    else:
        y_pred[i]=0
        
print(confusion_matrix(y_test, y_pred))
print(accuracy_score(y_test, y_pred))

5. Decision Tree Classifier

The decision Tree algorithm splits our data according to decisions to classify data. The decision tree is a supervised machine-learning algorithm and can be used for both classification and regression. A decision tree creates the nodes with conditions on the basis of information gain.

Information Gain = Entropy before split — Entropy after split

Entropy can be defined as the average level of uncertainty of outcome, or split as much labels possible by decision. This approach is used by ID3 algorithm and it uses Entropy function and Information gain as metrics. While the other approach is Classification and Regression Trees(CART) which uses Gini index(Classification) as metric.

from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(criterion="entropy", min_samples_split=2)
dt.fit(X_train, y_train)
dt_p = dt.predict(X_test)

print(confusion_matrix(y_test, dt_p))
print(accuracy_score(y_test, dt_p))

6. Gradient Boosting Algorithm

Gradient Boosting is a supervised ML algorithm and is used for classification and regression problems. Gradient Boosting uses an ensemble technique with multiple weak learners to predict. The idea behind the gradient boosting is to take the mean of the target label and subtract target values from the mean; the resulting values will be called residuals and consider the target value for the models. First, we train a weak learner and calculate the error; this error will be further minimized by feeding predicted residuals and actual residuals to the new model. This process is repeated to reduce the difference between the actual residuals and predicted residuals. At the final stage, this method combines the predictions from all the models.

from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_p = gbc.predict(X_test)

print(confusion_matrix(y_test, gbc_p))
print(accuracy_score(y_test, gbc_p))

7. Feature Engineering

In this section, our primary goal would be to increase the accuracy by doing feature engineering and hypertuning the parameters for the models. I have created a function that will reduce redundancy. In this function, I will split the data into training and testing datasets and convert the binary values for calculating accuracy. The LightGBM model is trained with specified parameters, and predictions are made on the test set. To enhance interpretability, the continuous predictions are then thresholded at 0.5 to convert them into binary values. The function then prints the confusion matrix and accuracy score for the LightGBM model. Additionally, for comparative analysis, the process employs Decision Trees and Gradient Boosting classifiers, printing their respective confusion matrices and accuracy scores.

def predict_fun(X, y, params):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # LightGBM
    train_dataset = gbm.Dataset(X_train, label=y_train)
    gbm_clf = gbm.train(params, train_dataset, 100)

    y_pred = gbm_clf.predict(X_test)

    # convert into binary values
    for i in range(len(y_pred)):
        if y_pred[i]>=0.5:
            y_pred[i]=1
        else:
            y_pred[i]=0
        
    print("_"*100)
    print("LightGBM Confusion Matrix - ",confusion_matrix(y_test, y_pred))
    print("LightGBM Accuracy Score - ", accuracy_score(y_test, y_pred))
    print("_"*100)

    dt = DecisionTreeClassifier(criterion="entropy", min_samples_split=2)
    dt.fit(X_train, y_train)
    dt_p = dt.predict(X_test)
    print("Decision Trees Confusion Matrix - ",confusion_matrix(y_test, dt_p))
    print("Decision Trees Accuracy Score - ",accuracy_score(y_test, dt_p))
    print("_"*100)

    gbc = GradientBoostingClassifier()
    gbc.fit(X_train, y_train)
    gbc_p = gbc.predict(X_test)
    print("Gradient Boosting Confusion Matrix - ",confusion_matrix(y_test, gbc_p))
    print("Gradient Boosting Accuracy Score - ",accuracy_score(y_test, gbc_p))
    print("_"*100)

Even after doing Principal Component Analysis, we can see that there is no change in accuracy. I will start doing advacned feature engineering now, by getting deeper in domain. We have height and weight given which can be converted into Body Mass Index(BMI).

BMI = Weight(kg) / (height(m) * height(m))

A fasting blood sugar level of 99 mg/dL or lower is normal, 100 to 125 mg/dL indicates one have prediabetes, and 126 mg/dL or higher indicates one have diabetes.

def feature_eng(X):
    X["bmi"] = (X['weight_kg']*10000) / (X['height_cm']**2)

    X.loc[X["fasting_blood_sugar"]<100, "fasting_blood_sugar"] = 1
    X.loc[(X["fasting_blood_sugar"] >=100) & (X["fasting_blood_sugar"]<=125), "fasting_blood_sugar"] = 2
    X.loc[X["fasting_blood_sugar"]>125, "fasting_blood_sugar"] = 3

    # Compute the average eyesight and hearing values from 'eyesight_left', 'eyesight_right', 'hearing_left', and 'hearing_right'.
    X["avg_eyesight"] = (X["eyesight_left"]+X["eyesight_right"])/2
    X["avg_hearing"] = (X["hearing_left"]+X["hearing_right"])/2

    # Create a new feature representing the blood pressure index using 'systolic' and 'relaxation'.
    X["blood_pressure_index"] = X["systolic"] / X["relaxation"]

    # random best parmas finder, catboost, feature engineering
    # Compute ratios like Total Cholesterol to HDL ratio, LDL to HDL ratio, etc.
    X['cholesterol_hdl_ratio'] = X['Cholesterol'] / X['HDL']
    X['ldl_hdl_ratio'] = X['LDL'] / X['HDL']

    # Create a new feature indicating kidney function using 'serum_creatinine' and 'Urine_protein'.
    X['kidney_function_indicator'] = X['serum_creatinine'] * X['Urine_protein']

    # Liver Enzyme Ratio:
    # Explore the relationship between 'AST', 'ALT', and 'Gtp'. You might consider creating a ratio or interaction term.
    X['alt_ast_ratio'] = X['ALT'] / X['AST']
    X['gtp_alt_ratio'] = X['Gtp'] / X['ALT']
    X = X.drop(columns=["ALT", "AST", "Gtp", "weight_kg", "height_cm", 
                        "serum_creatinine", "Urine_protein", "Cholesterol", "HDL", "LDL",
                        "systolic", "relaxation", "hearing_left", "hearing_right", "eyesight_right", "eyesight_left"])
    return X

8. Principal Component Analysis (PCA)

PCA reduces the dimensionality of the data
A best practice is to drop the redundant features and keep only the most essential features.

Let us try to see the difference by just dropping the non-essential features.

# TO DO
# Principal Component Analysis
X = train_df.drop(columns=["id", "smoking", "age", "waist_cm", "eyesight_left",
                           "eyesight_right", "hearing_left", "hearing_right", "Urine_protein", "dental_caries"])
y = train_df["smoking"]


scaler = StandardScaler()
X = scaler.fit_transform(X)

predict_fun(X, y, params)

From the above, we can see that dropping the columns is ineffective for better accuracy. Now let us try Principal Component Analysis with our feature set; we have 24 feature sets, which I will reduce to 15.

X = train_df.drop(columns=["smoking"])
pca = PCA(n_components=15)
X_pca = pca.fit_transform(X)
X_pca.shape

predict_fun(X_pca, y, params)

Let us do the feature engineering function we defined above and test the accuracy.

X_feat_eng = feature_eng(X)
predict_fun(X_feat_eng, y , params)

9. CatBoost Classifier

CatBoost is based on gradient-boosted decision trees, which are built during training. Each successive tree is created with reduced loss as compared to the previous tree. CatBoost supports numerical, categorical, text, and embedding features. CatBoost also supports an overfitting detector which can stop the training before the defined training parameter.

from catboost import CatBoostClassifier
from sklearn.model_selection import RandomizedSearchCV

params = {
    "learning_rate": [0.001, 0.005, 0.01, 0.05, 0.1],
    "depth":[4, 6, 8, 10, 12],
    "iterations":[100, 150, 200, 250, 300, 350]
}

cat_clf = CatBoostClassifier()

random_search = RandomizedSearchCV(cat_clf, param_distributions=params, n_iter=10, cv=5, scoring="accuracy", random_state=42)
random_search.fit(X_train, y_train)

best_params = random_search.best_params_
best_estimator = random_search.best_estimator_

cat_p = best_estimator.predict(X_test)
accuracy_score(cat_p, y_test)

That is all for this notebook. Please let me know if I made any mistake, and if you like this article, give it some claps. Thank you. Until next time 👋.