5 Components of a Supervised Machine Learning Model
In this blog post, we will look at components for building a comprehensive machine learning model. This post focuses on what is required for a ML model more so than the how to achieve it
[1] Training Data
In simples termsTraining data is a labeled information set that is used to create machine learning model
Training data is the food for all learning models which will influence your models performance and predictions.
Consider using good quality training data which is clean, sanitized and curated for the problem you are trying to solve.
Consider the Collection -> Cleansing -> Labeling cycle for improving the quality of your training data set
Collection: Extract the relevant feature set, ensure coverage and mitigate bias
Cleansing: Use processes for de-duplication, identify outliers, look for missing values and structural errors
Labeling: Ensure the labeling is accurate and consistent; use confusion matrix
for analysis, limit your labels and iterate continuously to improve the training data
[2] Performance Metric
The performance metric is what the model intends to optimize. It should be carefully chosen and should not result in unintended consequences
There are multiple metrics for evaluating the performance of a model some of which are highlighted below. This is by no means a comprehensive list but provides a solid starting point for model evaluation
Accuracy: # of correct predictions / # of total predictions
F1 Score: Measure of model performance based on precision and recall
Log Loss/Cross Entropy Loss: Loss function in logistic regression or neural networks
Mean Squared Error: Average of sqaured residuals i.e. the actual values and predicted values
Confusion Matrix: Matrix of True Positives, True Negatives, False Positives and False Negatives
Area under ROC Curve (AUC): Receiver Operating Characteristic or plot of True Positive Rate vs False Positive Rate
[3] ML Algorithm
Choosing the right ML algorithm is paramount for a successful machine learning model. Nuances of popular underlying models should be known before making the decision on algorithm of choice
Some key considerations to keep in mind are:
- Linear Regression models have a tendency for overfitting while Logistic Regression models have a tendency for underfitting
- Decision Tree models can be largely influenced by small changes in data which makes them unstable
- Random Forest or Gradient boosting models are difficult to map on Responsible AI principles resulting in predictions that can’t be easily mapped out. Random Forest model in inherently slow in predictions (optimizations do exist) which may impact the performance of an application on whole
- Deep Learning models or Neural Networks are very difficult to understand and extremely slow to train
[4] Hyperparameters
If Parameters or dimensions are what define a model algorithm function, hyper-parameters
are approaches or ways to identify the right set of these parameters
Simplest examples are learning rate and mini-batch size
Hyper-parameters is a complex topic and can take many different forms like being continuous or dependent on other parameters
[5] Evaluation Dataset
Dataset that is required to evaluate the performance of trained Machine Learning model. Could be a well-formed labeled dataset or a subset of the training data
All of the above components are essential for a predictable, efficient machine learning model and should be considered during ML model design and development