Model Selection Techniques in ML/AI with Python

Shachi Kaul
Analytics Vidhya
Published in
5 min readMay 31, 2020

In hunting down the solutions of ML related problems, various ML models are built and evaluated. Usually, model is decided based on your dataset by training and then evaluating unseen data. But it is always wise to have other better options hence model selection comes into picture.
Model selection can be like choosing either:
- a model with different hyper-parameters
or
- best among the candidate models

This blog discusses the various ways of selecting your best model. Selecting your model includes selecting the best-performed model in terms of accuracy, auc etc, and also less complex model. There are certain techniques that only focus on performance of model irrespective of its complexity which can lead to over-fitting and under-fitting situations. Also, more attention is given in the context of probabilistic model selection.

Choose Your Model

Multiple models are fitted and evaluated to choose the best. My research enlightened me with the chart.

Figure1

There are three ways of selecting your ML model in which two are the fields of probability and sampling. Let’s know about them.

  1. Random Train/Test Split:

Data to be passed in model is divided into train and test, ratio wise. This can be achieved using train_test_split function of scikit-learn python library.

On re-running of train_test_split data code, results come out to be different on each run of code. So, you aren’t sure how exactly your model will perform on unseen data.
The uncertainty of model performance in this technique introduces the following techniques.

2. Resampling:

  • In resampling technique of model selection, for a set of iterations, data is resampled into train/test followed by training on train and evaluation on test set.
  • Model chosen from this technique is assessed based on performance, not the model complexity.
  • Performance is computed on out-of-sample data. Resampling techniques estimate the error by evaluating out-of-sample data aka unseen data.
    - Strategies of resampling are such as K-Fold, StratifiedK-Fold etc.

For getting information in-depth on resampling strategies, it is recommended to visit Deeply Explained Cross-Validation in ML/AI.

3. Probabilistic Model Selection:

Probabilistic model selection is statistical methods whose quality can be measured by Information Criterion (IC). Its techniques involve scoring method that uses a probability framework of log-likelihood of Maximum Likelihood Estimation (MLE) to chose the best among candidate models.

A very useful way of selecting your model considering both performance and complexity, unlikely Resampling techniques.

Figure2

Let’s see more in-depth.

  • IC is a measure of statistics which leads to some score. Model with the lowest score means that less information is lost and considered as the best model. A single score is useless until you draw some comparison of scores of multiple models.
  • Model is chosen by scoring method whose scores are based on:
    - Performance on train data is evaluated using log-likelihood which comes from the concept of MLE so as to optimize model parameters. It says about how well your model is fitted with your data.
    - Model Complexity is evaluated using number of parameters (or degrees of freedom) in model.
  • Performance is computed on in-sample data which means test set is not required and the score is computed on whole train data directly.
  • Less complexity means a simple model with less parameters, easy to understand, and maintain but it can’t catch variations impacting the performance of a model.

Score rewards the models that achieve higher goodness-of-fit score and penalize them if they become over complex.

  • Common statistical methods are as follows:
    ~ AIC (Akaike Information Criterion) from frequentist probability
    ~ BIC(Bayesian Information Criterion) from bayesian probabaility
    -
    These are calculated using Log-likelihood which includes MSE (regression) and log_loss such as cross_entropy (classification).
  • When models are fitted with AIC/BIC, there is a chance that when likelihood is getting increased by adding more parameters, over-fitting comes into picture. Thus, penalty term is added in the equation.
    For instance for BIC,
    When more parameters are added in model, model is largely penalized to overcome the possibility of over-fitting which sometimes leads to a very simple model. Second term is a likelihood function that identifies best parameters to fit the model and measures goodness of fit, hence performance.
Figure3

AIC/BIC is deeply explained and implemented in Probabilistic Model Selection with AIC/BIC.

When can you call your model as the best

Figure4

For the above curiosity question, here’s the answer.
Have you heard of bias-variance trade-off? Well, here’s an answer key.
Less complex (or too simple) model means few parameters leads to high bias-low variance leads to under-fitting. While more parameters lead to low bias-high variance leads to over-fitting. Either way, few or too many parameters imply poor performance of model. Thus, penalty term is added to keep a balance like when more parameters are added , model is penalized with a huge penalty leads to a simpler model.
When your model becomes capable of balancing bias and variance, then you are good to achieve further.

You are free to follow this author if you liked the blog because this author assures to back again with more interesting ML/AI techs.
Thanks,
Happy Reading! :)

Can get in touch via LinkedIn.

--

--

Shachi Kaul
Analytics Vidhya

Data Scientist by profession and a keen learner. Fascinates photography and scribbling other non-tech stuff too @shachi2flyyourthoughts.wordpress.com