AIC, BIC, Penalties and the Science behind Model selection
Math Alert in this flight !! But much simplified. So sit back, relax and enjoy the flight.
You have been given a dataset. You are asked either to design the best model to perform regression or classification, OR to come up with a mechanism that offers a recipe to decide which of the models given to you works the best. What do you do?
On the one hand, you look at your “Confusion Matrix” and derive numbers for the metrics such as Precision, Recall, Sensitivity, F1 Score, Specificity, Area under Curve — PR, Accuracy, R2, MAE, MAPE, RMSE, etc. for each of your models to decide which of the models predict well. There are other ways where you compute the Log-loss or Cross-entropy loss and see which one is the best. There are umpteen references and articles on Medium that explains these phenomena. If you have doubts on this one, then it is for a different happy meal on a different flight. This flight is destined to take you closer to AIC, BIC and the science behind the best model selection.
What these methods stated above don’t ask is the question: How much these models overfitted with the data and gave you the best fit models while increasing their complexity tremendously. As an AI leader, or a Chief ML Scientist, you are behind the wheel to decide which model-road suits the best and steer your project-bus in that right direction.
Precisely for this purpose, we have what are known as penalty-based likelihood-based metrics called AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) to help you decide which model suits you the best. You will fly and travel through the peaks and valleys of AIC and BIC, but for the moment, hold that thought — you need to taxi your flight and figure out what is Likelihood, Log-likelihood and Maximum Likelihood Estimation before you hit the AIC/BIC runway.
So lets dive into Likelihood with a simple example of rolling a dice (with sides having 1 to 6 dots).
Likelihood: When you roll an unbiased dice, the probability of you getting a SIX is 1/6 and the probability of not getting a SIX is 5/6. So, P(SIX) = 1/6. Lets call it p. And P(not SIX) = 5/6 which would then be (1-p).
Now, lets say we would like to see a sequence while the rolling the dice and getting a SIX in every alternate attempt for a total of 5 attempts. That is SIX, not SIX, SIX, not SIX, SIX. So the likelihood of this trial to see the desired sequence would be
P(SIX, not SIX, SIX, not SIX, SIX | p) = p*(1-p)*p*(1-p)*p
Effectively, the likelihood of this sequence to occur would be 0.00321 which is much lesser than the individual probability of 1/6.
On the contrary, if the dice were to be a biased one such that the probability of getting a SIX is double that of any individual probability — which means P(SIX) = 2p = 2 * (1/6) = 1/3. The probability of getting anything other than SIX is (1–2p) = 1–1/3 = 2/3. So,
P(SIX, not SIX, SIX, not SIX, SIX | p) = 2p*(1–2p)*2p*(1–2p)*2p
This means the likelihood value is 0.4444. Now you see that a biased dice has a higher likelihood of getting the sequence.
So lets define Likelihood. It is the measure of the probability of observing the given dataset, for a model designed based on this dataset. Maximizing this probability or the likelihood of of getting your desired result for which you have built the model is the need of the hour. But these likelihoods, as you might have observed are very low-value floating point numbers which do not have sufficient ‘numerical stability’, as is evident in the above example. In order to improve the numerical stability, we need to compute what is called the logarithm of this likelihood.
Log-Likelihood: Taking Natural logarithm of the likelihood transforms the product of probabilities like [ p*(1-p)*p*(1-p)*p ] or [ 2p*(1–2p)*2p*(1–2p)*2p ] into a sum of log-likelihoods. Now it would be written as:
[ log(p) + log(1-p) + log(p) + log(1-p) + log(p) ] and
[ log(2p) + log(1–2p) + log(2p) + log(1–2p) + log(2p) ] respectively for unbiased and biased dices.
Log-likelihood is defined as the natural logarithm of the likelihood function of the given model.
Maximum Likelihood Estimation: Maximizing the log-likelihood would result in a model that would best suit the dataset. The resultant model is called Goodness of fit model.
Now that we have taxied onto the runway, let us fly to AIC and BIC. Without these mathematical fundamentals, we would not have been able to intuitively understand AIC and BIC.
AIC (Akaike Information Criterion): This is a metric that uses Log-likelihood and the parameters of the model and tries to strike a balance between Goodness of fit or performance of the model and the complexity of the model, thereby penalizing complex models. It basically computes the difference of 2 times the Log-likelihood of the model from 2 times the number of parameters in the model. It is mathematically given as
(-2) * Log-likelihood + 2 * k
where k is the number of parameters of the model, which implies number of features of the model + number of hyperparameters of the model. This is also called the “degrees of Freedom” of the model. In our biased dice example, this would be denoted as
(-2) * [ log(2p) + log(1–2p) + log(2p) + log(1–2p) + log(2p) ] + 2*k
If you have a large value of k — which means large number of features or large number of hyperparameters, the value of AIC for the model would be high and that would not be favorable for model selection. Lower number of parameters/Hyperparameters and higher log-likelihood of the model tends to lower the value of AIC. Hence, the goal is to have the least value of AIC computed for any model to be “chosen”. If you are either choosing a model out of a given set of models with different hyperparameters or designing different models and finding the best one, this method would penalize the complex models balancing with the log-likelihoods or goodness of fit of the models to the given dataset. Models that overfit tend to be more complex ones and hence have a higher potential to get rejected.
Now lets fly towards BIC.
BIC (Bayesian Information Criterion): This Bayesian metric is also a Maximum Likelihood Estimation technique that computes the difference of 2 times the Log-likelihood of the model from the product of number of parameters and log of number of datapoints in the dataset. The key difference is the second term which is the product of k and log(N) where is N is the size of the dataset, as opposed to 2 times k, in AIC. It is mathematically given as
(-2) * Log-likelihood + k * log(N)
where k is the number of parameters of the model, which implies number of features of the model + number of hyperparameters of the model, and N is the size of the dataset.
If you intuitively look at this expression, it considers how big the dataset is. The bigger the dataset, the more complex your model could become. The bigger the dataset, the higher is the probability that many models don’t fit well. Some of them that fit well could be complex and overfitting on the data too. Now applying BIC would filter out the most complex models, based on the size of dataset. It provides you with the right fit by penalizing the most complex ones. It strikes a balance in giving you a model that didn’t overfit the dataset, as well.
In essence, this is a more penalizing technique compared to AIC and I like it specifically because it takes the size of the dataset into account — I feel it is an important factor to consider when you compute the performance and complexity of the model, especially during these overfitting seasons in the industry.
So for our biased dice example, BIC of our model would be denoted as
(-2) * [ log(2p) + log(1–2p) + log(2p) + log(1–2p) + log(2p) ] + (k * log(5))
Now lets dive into some code. The below code and its output would tell you why a deep learning model is not a suitable model for a dataset like the famous Iris dataset and a simple Logistic Regression would suffice.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import log_loss
# Load Iris dataset
iris = load_iris()
X = iris.data
y = iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit logistic regression model using scikit-learn
model_sklearn = LogisticRegression(multi_class='auto', max_iter=1000)
model_sklearn.fit(X_train, y_train)
# Predictions on the test set
y_pred = model_sklearn.predict(X_test)
# Calculate log likelihood (negative log-likelihood) for the test set
log_likelihood = -log_loss(y_test, model_sklearn.predict_proba(X_test))
print(f"Log-likelihood: {log_likelihood}")
# Calculate AIC and BIC
num_params = X_train.shape[1] + 1 # Number of parameters (features + intercept)
print(f"Number of Parameters: {num_params}")
n_samples = len(y_test)
aic = 2 * num_params - 2 * log_likelihood
bic = np.log(n_samples) * num_params - 2 * log_likelihood
# Display AIC and BIC
print(f"AIC: {aic}")
print(f"BIC: {bic}")
The above code is a Logistic Regression model trained and tested on the Iris dataset. The output of the model below shows that it has a model complexity of 5 (parameters considering the features and its hyperparameters), a Log-likelihood of -0.111283 and AIC and BIC values of 10.22 and 17.22 respectively.
Now lets run a 100 layer Multi-layer Perceptron, a deep learning model and find out how the model performs, how complex the model is, what the values for AIC and BIC are and try to see if this model is suitable for this dataset, given its performance.
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import load_iris
import numpy as np
# Load the Iris dataset
iris = load_iris()
X, y = iris.data, iris.target
# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create an MLP classifier with L2 regularization (alpha)
mlp_classifier = MLPClassifier(hidden_layer_sizes=(100,), alpha=0.01, max_iter=1000, random_state=42)
# Train the MLP classifier
mlp_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = mlp_classifier.predict(X_test)
# Evaluate the performance
accuracy = accuracy_score(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred)
# Print the results
print(f"Accuracy: {accuracy:.2f}")
print("Classification Report:\n", classification_rep)
# Calculate log likelihood (negative log-likelihood) for the test set
log_likelihood = -log_loss(y_test, model_sklearn.predict_proba(X_test))
print(f"Log-likelihood: {log_likelihood}")
# Print the number of parameters (model complexity)
num_params = sum([np.prod(w.shape) for w in mlp_classifier.coefs_]) + len(mlp_classifier.intercepts_)
print(f"Number of Parameters: {num_params}")
n_samples = len(y_test)
aic = 2 * num_params - 2 * log_likelihood
bic = np.log(n_samples) * num_params - 2 * log_likelihood
# Display AIC and BIC
print(f"AIC: {aic}")
print(f"BIC: {bic}")
The output of the model above shows that it has a model complexity of 702 — gigantic when compared with the simple Logistic Regression model, a Log-likelihood of -0.111283 — exactly same as what we got on the simpler model and a whopping 1404.22 and 2387.86 respectively for AIC and BIC.
Recall our goal was to minimize AIC and BIC or find the model that had the least AIC and BIC. It is important to note that these values have already taken into account the model performance and the model complexity as well. And based on the complexity considering the performance, it levied a huge penalty on the MLP model.
So, the AI leader in you helped your team decide what is the best bet. Hurray! Congratulations. Now you know how to practically deal with the problem of Model selection, given the science (and the math) behind it.
Points to note:
- AIC and BIC are not suitable for non-probabilistic models that either don’t provide Likelihood estimates or quantify the parameters well. So, you have to rely on cross-validation, and having train-validation-test split to avoid overfitting of the model, apart from the other usual metrics.
- They are even uncommon for MLP classifiers. But I just showed them here for illustrative purposes.
- You can even determine the best value of ‘k’ in k-means clustering using AIC and BIC techniques.
- Note that BIC is always higher than AIC which implies that BIC always puts a higher penalty on complexity compared to AIC.
- They can be applied on various models like Linear Regression, Logistic Regression, Time-series, Generalized Linear models, Clustering models like Gaussian Mixture Models, MANOVA, etc.
We have arrived at your destination. You have been a wonderful and a kind passenger. Hope you had a great flight. See you very soon next time.