Workflow of Supervised Learning algorithms

Vishakha Ratnakar
4 min readJan 22, 2022

--

Machine learning algorithms are divided into four categories: supervised, unsupervised, semi-supervised, and reinforcement. In my previous blog, I provided a fundamental overview of all four algorithms. You are welcome to take a look at it (Machine Learning Algorithms).

Let’s learn about the workflow of the Supervised Learning model as follow.

Step 1: Data Gathering

The data for training the model can come from a variety of places. The Scikit-learn library provides inbuilt datasets that can be used for training as well as testing purpose. We can also make use of some freely available datasets on the internet like the Kaggle dataset. However, this data cannot be used directly since it may contain missing information, be noisy, or require conversion into a specific format.

for example, the data for a regression task should be numeric. If the data you’ve imported has categorical values, you’ll need to convert them to numeric values, which you can do with data pre-processing methods.

Step 2: Data pre-processing

Data pre-processing is an important part of machine learning. It is the process of cleansing raw data acquired from various sources and preparing it for use in model training. To get good results from an applied model, data preprocessing is required.

Steps in data pre-processing include:

Data Cleaning: the process of filling in blanks, correcting errors, and deleting unnecessary data.

Data Transformation: Techniques including aggregation, normalization, and feature selection.

Data Reduction: the process of extracting relevant data from a large amount of data to meet a specific requirement. It entails attribute selection and dimension reduction.

Step 3: Decide on a model.

Many models have been developed. Some are more suited to visuals, while others are better suited to numerical data, text, and so on. It’s important to pick the right model. The model can be either supervised or unsupervised based on the requirement and dataset.

Step 4: Split the Dataset

We divide the dataset into three parts: training set, validation set, and testing data. A training dataset is used to train and validate the model, while a testing dataset is used to test the trained model’s accuracy.

Training set: The collection of data used to train the model. The features are learned by the model from the data.

Validation set: The validation set is used to validate the model’s performance during training. Validation’s major goal is to keep the model from overfitting.

Testing set: Used to evaluate the model’s accuracy once it has been trained. It’s only used when the model has been thoroughly trained.

A validation set is frequently utilized as a test set, but this is not a good practice.

Splitting of Dataset

Step 5: Train the model

The model is trained by feeding datasets in this step. Train the model using the appropriate, machine learning algorithm based on the dataset and task requirements.

Step 6: Evaluation

Model evaluation is an important aspect of model development. Using a testing dataset, a model evaluation is carried out. It enables you to validate the data using a dataset that has never been used for training.

Basic performance measure for Classification task.

Accuracy: Proportion of test cases classified correctly

Basic performance measure for Regression task.

Root mean squared error: square each error, average over the number of test cases, and get the square root.

For multi-class issues, the confusion matrix is more informative than accuracy. There are four different outcomes:

True Positive: When the actual classification is positive and the predicted classification is similarly positive.

True Negative: When both the actual and predicted categories are negative.

False Positive: When the actual classification is negative but the predicted classification is positive.

False Negative: When the actual classification is positive but the predicted classification is negative.

Confusion Matrix

We can try to increase the number of true positives and true negatives by looking at the confusion matrix.

Summary

Below is the flowchart of the Machine Learning workflow for Supervised Learning.

--

--

Vishakha Ratnakar

Masters in Data analytics from National University of Ireland, Galway . LinkedIn: www.linkedin.com/in/vishakha-ratnakar