5 Components of a Supervised Machine Learning Model

Published in

Aruva.io Tech

3 min readApr 1, 2021

In this blog post, we will look at components for building a comprehensive machine learning model. This post focuses on what is required for a ML model more so than the how to achieve it

[1] Training Data

In simples terms
Training data is a labeled information set that is used to create machine learning model

Training data is the food for all learning models which will influence your models performance and predictions.

Consider using good quality training data which is clean, sanitized and curated for the problem you are trying to solve.

Consider the Collection -> Cleansing -> Labeling cycle for improving the quality of your training data set

Collection: Extract the relevant feature set, ensure coverage and mitigate bias

Cleansing: Use processes for de-duplication, identify outliers, look for missing values and structural errors

Labeling: Ensure the labeling is accurate and consistent; use confusion matrix for analysis, limit your labels and iterate continuously to improve the training data

[2] Performance Metric

The performance metric is what the model intends to optimize. It should be carefully chosen and should not result in unintended consequences

There are multiple metrics for evaluating the performance of a model some of which are highlighted below. This is by no means a comprehensive list but provides a solid starting point for model evaluation

Accuracy: # of correct predictions / # of total predictions

F1 Score: Measure of model performance based on precision and recall

Log Loss/Cross Entropy Loss: Loss function in logistic regression or neural networks

Mean Squared Error: Average of sqaured residuals i.e. the actual values and predicted values

Confusion Matrix: Matrix of True Positives, True Negatives, False Positives and False Negatives

Area under ROC Curve (AUC): Receiver Operating Characteristic or plot of True Positive Rate vs False Positive Rate

[3] ML Algorithm

Choosing the right ML algorithm is paramount for a successful machine learning model. Nuances of popular underlying models should be known before making the decision on algorithm of choice

Some key considerations to keep in mind are:

Linear Regression models have a tendency for overfitting while Logistic Regression models have a tendency for underfitting
Decision Tree models can be largely influenced by small changes in data which makes them unstable
Random Forest or Gradient boosting models are difficult to map on Responsible AI principles resulting in predictions that can’t be easily mapped out. Random Forest model in inherently slow in predictions (optimizations do exist) which may impact the performance of an application on whole
Deep Learning models or Neural Networks are very difficult to understand and extremely slow to train

[4] Hyperparameters

If Parameters or dimensions are what define a model algorithm function, hyper-parameters are approaches or ways to identify the right set of these parameters

Simplest examples are learning rate and mini-batch size

Hyper-parameters is a complex topic and can take many different forms like being continuous or dependent on other parameters

[5] Evaluation Dataset

Dataset that is required to evaluate the performance of trained Machine Learning model. Could be a well-formed labeled dataset or a subset of the training data

All of the above components are essential for a predictable, efficient machine learning model and should be considered during ML model design and development