Workflow of Supervised Learning algorithms
Machine learning algorithms are divided into four categories: supervised, unsupervised, semi-supervised, and reinforcement. In my previous blog, I provided a fundamental overview of all four algorithms. You are welcome to take a look at it (Machine Learning Algorithms).
Let’s learn about the workflow of the Supervised Learning model as follow.
Step 1: Data Gathering
The data for training the model can come from a variety of places. The Scikit-learn library provides inbuilt datasets that can be used for training as well as testing purpose. We can also make use of some freely available datasets on the internet like the Kaggle dataset. However, this data cannot be used directly since it may contain missing information, be noisy, or require conversion into a specific format.
for example, the data for a regression task should be numeric. If the data you’ve imported has categorical values, you’ll need to convert them to numeric values, which you can do with data pre-processing methods.
Step 2: Data pre-processing
Data pre-processing is an important part of machine learning. It is the process of cleansing raw data acquired from various sources and preparing it for use in model training. To get good results from an applied model, data preprocessing is required.
Steps in data pre-processing include:
•Data Cleaning: the process of filling in blanks, correcting errors, and deleting unnecessary data.
•Data Transformation: Techniques including aggregation, normalization, and feature selection.
•Data Reduction: the process of extracting relevant data from a large amount of data to meet a specific requirement. It entails attribute selection and dimension reduction.
Step 3: Decide on a model.
Many models have been developed. Some are more suited to visuals, while others are better suited to numerical data, text, and so on. It’s important to pick the right model. The model can be either supervised or unsupervised based on the requirement and dataset.
Step 4: Split the Dataset
We divide the dataset into three parts: training set, validation set, and testing data. A training dataset is used to train and validate the model, while a testing dataset is used to test the trained model’s accuracy.
Training set: The collection of data used to train the model. The features are learned by the model from the data.
Validation set: The validation set is used to validate the model’s performance during training. Validation’s major goal is to keep the model from overfitting.
Testing set: Used to evaluate the model’s accuracy once it has been trained. It’s only used when the model has been thoroughly trained.
A validation set is frequently utilized as a test set, but this is not a good practice.
Step 5: Train the model
The model is trained by feeding datasets in this step. Train the model using the appropriate, machine learning algorithm based on the dataset and task requirements.
Step 6: Evaluation
Model evaluation is an important aspect of model development. Using a testing dataset, a model evaluation is carried out. It enables you to validate the data using a dataset that has never been used for training.
Basic performance measure for Classification task.
•Accuracy: Proportion of test cases classified correctly
Basic performance measure for Regression task.
•Root mean squared error: square each error, average over the number of test cases, and get the square root.
For multi-class issues, the confusion matrix is more informative than accuracy. There are four different outcomes:
•True Positive: When the actual classification is positive and the predicted classification is similarly positive.
•True Negative: When both the actual and predicted categories are negative.
•False Positive: When the actual classification is negative but the predicted classification is positive.
•False Negative: When the actual classification is positive but the predicted classification is negative.
We can try to increase the number of true positives and true negatives by looking at the confusion matrix.
Summary
Below is the flowchart of the Machine Learning workflow for Supervised Learning.