Body Performance Project-2.5

Supervised Learning: Classification approach

Daniel Chiebuka Ihenacho
6 min readJan 2, 2024
Photo by Acton Crawford on Unsplash

In the previous post, we explored Cluster Analysis and now we continue with our routine; Supervised Learning: Classification approach

What is supervised learning?

This is simply a machine learning paradigm where the algorithm is trained on labeled data, which means that the input data is paired with corresponding output labels. The primarily goal of supervised learning is to learn how to map inputs to the output based on provided labeled examples.

In supervised learning, the objective is to develop a model using a training dataset and subsequently make precise predictions on new, unseen data sharing similar traits with the training set. The capability of a model to provide accurate predictions on unseen data indicates its ability to generalise from the training set to the test set. The aim is to develop a model with the highest possible level of generalisation accuracy.

Kinds of supervised learning

There are mainly two types of supervised machine learning problems;

  1. Classification: The goal is to predict a class label, which is a choice from a predefined list possibilities as we would see later on in this article. Classification tasks can be divided into two kinds; Binary classification (deals with two set of classes) and multiclass classification (deals with more than two classes).
  2. Regression: The main goal is to predict a continuous value, or a floating point value (a real number in mathematical terms). We would see the use of this approach in another article.

Applications of supervised learning

Classification task

  • Diabetes prediction
  • Lung cancer prediction
  • Penguins species prediction

Regression task

  • Corn yield production
  • Height prediction
  • Score prediction
  • House price prediction

Having learned a bit of supervised learning, it’s time to dive in the project itself. The dataset and code could be gotten from my repo .

Here’s our clustering analysis process;

  1. Load the data
  2. Split the data into training and test data
  3. Create a Pipeline with needed classification algorithms.
  4. Pass the Pipeline into the created GridSearchCV
  5. Choose the best known estimator with its parameters
  6. Evaluate the classification algorithm on body train and test data using Confusion matrix and AUC and ROC curve.

Sampling data

# Load the data
df = pd.read_csv("classification_dataset_two.csv")
df.sample(random_state=42,n=5)
Sampled data

Import needed libraries for modelling and data split

from sklearn.pipeline import Pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier
from sklearn.model_selection import GridSearchCV,StratifiedShuffleSplit,train_test_split

X = df.drop('encoded_class',axis=1)
y = df['encoded_class']
X_train,X_test, y_train,y_test = train_test_split(X,y,random_state=42,stratify=y,test_size=.20)

Pipeline and GridSearchCV Creation

my_pipe = Pipeline(
[
('clf',KNeighborsClassifier(n_jobs=-2))
]
)

my_params = [
{
# KNeighborsClassifier
"clf":[KNeighborsClassifier(n_jobs=-2)],
'clf__n_neighbors':[3,4,5,6],
"clf__weights":['uniform','distance'],
"clf__p":[1,2]
},

{
# RandomForestClassifier
'clf':[RandomForestClassifier(random_state=42,warm_start=True,n_jobs=-2)],
"clf__n_estimators":[50,100,150,200],
"clf__max_depth":[3,5,7],
'clf__min_samples_split':[2,3,5]
},

{ # LogisticRegression
"clf":[LogisticRegression(random_state=42,n_jobs=-2)],
'clf__solver':['newton-cg','sag','saga','lbfgs'],
'clf__penalty':['l2','l1','elasticnet','none'],
"clf__C":[0.5, 0.7,1]
},

{ # SGDClassifier
"clf":[SGDClassifier(random_state=42,warm_start=True,n_jobs=-2)],
'clf__early_stopping':[True,False],
'clf__loss':['hinge','log_loss'],
"clf__alpha":[0.0001, 0.0005,0.001,0.005,0.01]
},

{
# DecisionTreeClassifier
"clf":[DecisionTreeClassifier(random_state=42)],
'clf__min_samples_split':[2,3,5],
"clf__max_depth":[3,5,7],
},

{ # AdaBoostClassifier
'clf':[AdaBoostClassifier(random_state=42)],
"clf__n_estimators":[50,100,150,200],
},
]

my_cv = StratifiedShuffleSplit(n_splits=5,test_size=.20,random_state=42)
mygrid = GridSearchCV(my_pipe,param_grid=my_params,cv=my_cv)
mygrid.fit(X_train,y_train)

Why Grid Search?

Grid search is a hyperparameter tuning technique in machine learning where a predefined set of hyperparameter values is exhaustively tested to find the combination that yields the best model performance. The hyperparameters are parameters that are not learned from the data but are set before the training process. In grid search, a grid of hyperparameter values is specified, and the algorithm is trained and evaluated for each combination of these values using cross-validation.

As seen from the code above, a pipeline could also be combined with GridSearch which leads to a bigger search space.

Once the above code is executed the below output is given;

GridSearchCV

GridSearchCV(cv=StratifiedShuffleSplit(n_splits=5, random_state=42, test_size=0.2,
train_size=None),
estimator=Pipeline(steps=[('clf',
KNeighborsClassifier(n_jobs=-2))]),
param_grid=[{'clf': [KNeighborsClassifier(n_jobs=-2)],
'clf__n_neighbors': [3, 4, 5, 6], 'clf__p': [1, 2],
'clf__weights': ['uniform', 'distance']},
{'clf': [RandomForestClassifier(max_depth=7,
min_sa...
{'clf': [SGDClassifier(n_jobs=-2, random_state=42,
warm_start=True)],
'clf__alpha': [0.0001, 0.0005, 0.001, 0.005, 0.01],
'clf__early_stopping': [True, False],
'clf__loss': ['hinge', 'log_loss']},
{'clf': [DecisionTreeClassifier(random_state=42)],
'clf__max_depth': [3, 5, 7],
'clf__min_samples_split': [2, 3, 5]},
{'clf': [AdaBoostClassifier(random_state=42)],
'clf__n_estimators': [50, 100, 150, 200]}])

GridSearchCV inspection

print(f"Best params: {mygrid.best_params_}\n")
print(f"Best estimator: {mygrid.best_estimator_}\n")
print(f"Best validation score: {mygrid.best_score_}")

Inspection output

Best params: {'clf': RandomForestClassifier(max_depth=7, min_samples_split=5, n_jobs=-2,
random_state=42, warm_start=True), 'clf__max_depth': 7, 'clf__min_samples_split': 5, 'clf__n_estimators': 100}

Best estimator: Pipeline(steps=[('clf',
RandomForestClassifier(max_depth=7, min_samples_split=5,
n_jobs=-2, random_state=42,
warm_start=True))])

Best validation score: 0.7338310779281382

When GridSearchCV is applied to the dataset, the best validation score is 0.73 or 73% (accuracy score), with RandomForestClassifier being the best estimator with its hyperparameters given. Hence we can proceed to the evaluation phase using the confusion matrix and ROC curve.

Evaluation

Not all evaluation metrics are suitable for a given task. For example, a RMSE is a evaluation metric for regression tasks and not suitable for classification tasks. Hence choosing the right evaluation metric is paramount to the task at hand.

Some evaluation metrics for classification task are;

  • Confusion matrix
  • AUC/ROC curve
  • Precision
  • Accuracy
  • Recall
  • Precision-recall curve

For a deep dive in these metrics please refer to referred materials. For this problem set, confusion matrix and AUC and ROC would be made use of.

Confusion matrix setup for training data

# Classification report
from sklearn.metrics import classification_report,confusion_matrix,ConfusionMatrixDisplay
y_pred_train = mygrid_trainset.predict(X_train)

y_train = y_train
sns.set_theme(style='white')
def class_report(model,y_train,y_pred_train):
print(classification_report(y_train,y_pred_train))
cm = confusion_matrix(y_train,y_pred_train,labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.show()


class_report(mygrid_trainset,y_train,y_pred_train)

Once the above code is executed the below output is given;

Confusion matrix of training data

From the above, we can see some misclassifications by the model but especially for class 1. To be sure that the model is not performing random guessing, we employ the use AUC and ROC curve as shown below;

ROC and AUC setup for training data

from yellowbrick.classifier import ROCAUC
visualizer = ROCAUC(mygrid_trainset, classes=[0,1,2,3])

visualizer.fit(X_train, y_train) # Fit the training data to the visualizer
visualizer.score(X_train, y_train) # Evaluate the model on the training data
visualizer.show()
ROC curve for RandomForestClassifier training data

As seen the model does not perform random guessing but the AUC score for class 1 is 0.89 or 89%, if AUC score ≤ 0.5 or 50%, it is deemed as random guessing. Hence the model would have to be retrained.

The same approach is taken for the test data as shown below;

Confusion matrix setup for test data

mygrid_testset = mygrid.best_estimator_
y_pred_test = mygrid_testset.predict(X_test)

# Classification report
y_test = y_test
sns.set_theme(style='white')
def class_report(model,y_test,pred):
print(classification_report(y_test,pred))
cm = confusion_matrix(y_test,pred,labels=model.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=model.classes_)
disp.plot()
plt.show()


class_report(mygrid_testset,y_test,y_pred_test)
Confusion matrix of test data

ROC and AUC setup for test data

from yellowbrick.classifier import ROCAUC
visualizer = ROCAUC(mygrid_testset, classes=[0,1,2,3])

visualizer.fit(X_test, y_test) # Fit the training data to the visualizer
visualizer.score(X_test, y_test) # Evaluate the model on the training data
visualizer.show()
ROC curve for RandomForestClassifier test data

Conclusion

Congrats! You just finished your first classification task, what’s even better is that you performed it on a multiclass problem set. In this article, you were introduced to the following;

  • Pipeline
  • GridSearchCV
  • Evaluation metrics

What can you do to take this even further;

  • Save the model apply it to the prediction of a new data set. Keep in mind that you performed a clustering task and hence would have to incorporate the saved clustering model into the classification task
  • You could eventually make a comprehensive data pipeline up until the classification task and serve as a web app using Streamlit or even as a desktop application.

References

  • Introduction to Machine Learning with Python by Andreas C. Müller and Sarah Guido; Chapters 5 and 6.

--

--

Daniel Chiebuka Ihenacho

A Data scientist & Analyst — Always looking to learn and grow in the data field. Looking forward to connecting with you all