Using the Corrected Paired Student’s t-test for comparing Machine Learning Models

Published in

Analytics Vidhya

8 min readSep 26, 2019

Comparing the performance of machine learning (ML) methods for a given task and selecting a final method is a common operation in applied ML.

The purpose of this post is to, initially, demonstrate why we need to use statistical methods for choosing the final model. Then, it explains why one the frequently used statistical hypothesis tests (i.e., paired Student’s t-test) is inadequate for comparing the performance of ML models. Finally, this post demonstrates how the corrected version of the paired Student’s t-test can be implemented for examining the performance of ML models.

The following figure shows the performance of ten different classification models in terms of F-score trained using a specific dataset.

The performance of ten different ML models (QDA: Quadratic Discriminant Analysis, LDA: Linear Discriminant Analysis, SVM: support vector machine, KNN K-nearest neighbors)

The performance of the models is measured using the test or unseen dataset. In addition, they are trained using K-fold cross-validation, which randomly partitions the dataset into different folds as shown in the following figure.

Splitting the data set into testing and training datasets. These two datasets might change each time, hence, the performance of trained ML models might change

It is clear from the figure that QDA performs much better than other classification models for this specific dataset. The question that might come up is: “whether the results statistically provide convincing evidence that QDA outperforms other applied ML models?”. In fact, we need to estimate whether the differences between the performance of ML models are true and reliable or they are just due to statistical chance.

Before answering the above question, we need to know that, in practice, we often have one dataset of size N and all estimates must be obtained from this one dataset. It is usually common to obtain different training sets by subsampling, and use the instances not sampled for training for testing. The performance of the classifiers might change when using different subsamples of data for testing and training. In other words, the performance of an ML model is very sensitive to the particular random partitioning used in the training process. The following figure shows that the performance of an ML model may change when using different subsets of training and test.

For training an ML model, we should split the dataset into two different test and training datasets. Then, we train the model based on the training dataset using k-fold cross-validation and evaluate the model on the holdout test dataset.

To understand how changing the training and test sets might change the performance of a model, we’ll do a very simple example with the Pima Indians diabetes dataset. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. We use two different ML models including random forest and support vector machine (SVM) to train models for predicting whether a person has diabetes or not. We trained the models two times, and each time we used a different subsample of data as the training set (this can be done by changing the seed number or random_state). Also, we used those observations not sampled for training as the test dataset.

#Import required libraries
#import kaggle
import random
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import model_selection, svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import RandomizedSearchCV
import warnings
warnings.filterwarnings("ignore")# To download the dataset
!kaggle datasets download -d uciml/pima-indians-diabetes-database#To read the dataset
df_pima = pd.read_csv('pima-indians-diabetes-database.zip')X = df_pima.drop('Outcome', axis=1)
y = df_pima['Outcome']RFC_score = []
SVM_score = []
for random_state in [42, 193]:
    # Splitting the dataset into train and test set
    #random_state = random.randint(100, 10000)
    #print(random_state)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state= random_state)
    
    # number of trees in random forest
    n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
    # number of features at every split
    max_features = ['auto', 'sqrt']# max depth
    max_depth = [int(x) for x in np.linspace(100, 500, num = 11)]
    max_depth.append(None)
    # create random grid
    random_grid = {
     'n_estimators': n_estimators,
     'max_features': max_features,
     'max_depth': max_depth
     }
    # Random search of parameters
    rfc_random = RandomizedSearchCV(estimator = RandomForestClassifier(), param_distributions = random_grid, 
                                    n_iter = 100, cv = 3, verbose=0, random_state=42, n_jobs = -1)
    # Fit the model
    rfc_random.fit(X_train, y_train)# print results
    best_params = rfc_random.best_params_rfc = RandomForestClassifier(n_estimators=best_params['n_estimators'], 
                                 max_depth=best_params['max_depth'],
                                 max_features=best_params['max_features'], 
                                 random_state=42).fit(X_train,y_train);
    RFC_score.append(rfc.score(X_test, y_test))
    
    
    ##Train SVM 
    random_grid_svm = {
    'C': [0.001, 0.01, 0.1, 1, 10],
    'gamma': [0.001, 0.01, 0.1, 1]}
    svm_random = RandomizedSearchCV(estimator = svm.SVC(kernel='rbf'),
                                    param_distributions = random_grid_svm, 
                                    n_iter = 100, cv = 3, verbose=0,
                                    random_state=42, n_jobs = -1)
    svm_random.fit(X_train, y_train)
    
    best_params = svm_random.best_params_
    
    SVM_model = svm.SVC(kernel='rbf',C = best_params['C'],
              gamma=best_params['gamma'], random_state=42).fit(X_train,y_train);
    
    SVM_score.append(SVM_model.score(X_test, y_test))
    
    #print('Iteration {}'.format(i))
    print('The accuracy of SVM model is {}'.format(round(SVM_model.score(X_test, y_test),2)*100))
    print('The accuracy of Random Forest model is {}'.format(round(rfc.score(X_test, y_test),2)*100))
    print('-'*30)

We can see that for the first time, the random forest method performs better than SVM; while SVM shows a better performance at the second iteration. This means that using the difference between the performances of models resulted just using one iteration would not be a suitable way to choose the final model. Besides, we do not know the distribution underlying the domain and consequently can not compute the difference exactly. Therefore, we need to estimate the distribution of differences. Having the distribution of differences enables us to check whether the estimated difference is likely to be “true” difference or just due to chance. For answering the question, we can use a statistical test.

To perform a statistical test, we also need to have the mean and variance of the differences over different iterations. Obtaining an unbiased estimate of the mean and variance of the differences is easy if there is a sufficient supply of data. For this purpose, we can sample a number of training and test sets from the main dataset. Then, we should train the ML models on each of the training sets and measure the performance of the models using the holdout test subsets. Next, we can compute the difference in the performance of the models for each pair of classifiers.

The estimate of the mean and variance of the differences between the performance of two classifiers

Let's implement this for our example. We repeated the process of training ML models 100 times to see the effect of randomly splitting the dataset on the performance of the models. The following figure compares the performance of random forest and SVM models in terms of accuracy.

A comparison between the performance of random forest and SVM for 100 repetitions

Difference between the performance of random forest and SVM for 100 repetitions

Statistical Hypothesis Tests

Having the above information, we can now use a statistical hypothesis test to select the final model. Statistical significance tests are designed to address to compare the performance of ML models and quantify the likelihood of the samples of performance score being observed given the assumption that they were drawn from the same distribution. If this assumption, or null hypothesis, is rejected, it suggests that the difference in skill scores is statistically significant.

The most common statistical hypothesis test used for comparing the performance of ML models is the paired Student’s t-test combined via random subsamples of the training dataset. The null hypothesis in this test is that there is no difference between the performance of two applied ML models. In other words, the null hypothesis assumes that both ML models perform the same. On the other hand, the alternative hypothesis assumes that two applied ML models perform differently.

Although the paired Student’s t-test is a very common way to compare the performance of two different ML models, we need to check the assumptions behind this test before using it. The key assumption is that: “the data used to carry out the paired student’s t-test should be sampled independently from the two populations being compared”. This assumption is in general not testable from the data. However, if the data are known to be dependently sampled, then the paired Student’s t-test might give misleading results. The main consequence of violating this assumption is a high type I error (i.e., the rejection of a true null hypothesis).

In the case of comparing the performance of ML models, as mentioned above, the test and training sets are usually obtained from different subsamples of the original data. The test and training sets overlap in different iterations and, thus, they are not independent. This violates the independence assumption necessary for proper significance testing because we re-use the data to obtain the differences. The consequence of the violation of independence assumption is that the Type I error exceeds the significance level. Hence, if we use this test, we might find a “significant” difference between the performance of two ML models while there is none.

In overall, the paired Student’s t-test is not a valid test for comparing the performance of two ML models.

Nadeau and Bengio showed that the violation of the independence t-test might lead to underestimation of the variance of the differences. To solve this problem with the paired Student’s t-test, they propose to correct the variance estimate by taking this dependency into account. The following figure shows how we can modify the variance estimate using the proposed method by Nadeau and Bengio and compute the P-value.

The corrected paired Student’s t-test proposed by Nadeau and Bengio

The t statistic computed above is used with the Student-t distribution with n-1 degrees of freedom to quantify the level of confidence or significance in the difference between the performance of models. This allows stronger and more reliable claims to be made as part of model selection than using the original paired Student’s t-test.

Since the null hypothesis is that there is nothing going on or there is no difference between the performance of two ML models, a P-value smaller than the considered significance level rejects the null hypothesis in favor of the alternative hypothesis, which assumes the ML models perform differently. In addition, a P-value greater than the significance levels shows that we fail to reject the null hypothesis.

We used the above procedure to compare the performance of random forest and SVM models for our case study dataset.

#Compute the difference between the resultsdiff = [y - x for y, x in zip(RFC_score, SVM_score)]#Comopute the mean of differences
d_bar = np.mean(diff)#compute the variance of differences
sigma2 = np.var(diff)#compute the number of data points used for training 
n1 = len(y_train)#compute the number of data points used for testing 
n2 = len(y_test)#compute the total number of data points
n = len(y)#compute the modified variance
sigma2_mod = sigma2 * (1/n + n2/n1)#compute the t_statict_static =  d_bar / np.sqrt(sigma2_mod)from scipy.stats import t#Compute p-value and plot the results 
Pvalue = ((1 - t.cdf(t_static, n-1))*200)Pvalue

The P-value for the present case study is about 1.85%, which is smaller than the considered significance level (i.e., 5%) showing that we can reject the null hypothesis. Therefore, the results statistically provide convincing evidence that random forest and SVM perform differently. On average, the mean accuracy for the random forest model is 4% more than that for the SVM model.

Using the Corrected Paired Student’s t-test for comparing Machine Learning Models

Statistical Hypothesis Tests

Written by Jalal Kiani