Elo Ranks for Visual Task Adaptation Benchmark

Published in

ResponsibleML

4 min readFeb 19, 2021

Joint work with Alicja Gosiewska and Przemysław Biecek.

With the development of advanced machine and deep learning techniques, there is a growing need for assessment of the difference between models. Besides the model optimization for one particular problem, we would like to know which machine learning algorithms perform well on a wide variety of unseen tasks. Recently, many ranking approaches are proposed across different ML branches. These ratings, using specified performance measure, focus on the one selected aspect of the evaluation of algorithms or use mean value across different metrics. The next drawback is often incomparability between tasks. For every dataset according to its complexity may be different reference values of the same measure.
EPP meta-measure is based on the Elo measure assumptions and enables to more profound comparison across tasks and metrics.

Visual Task Adaptation Benchmark

The Visual Task Adaptation Benchmark (VTAB) is a suite of tasks, designed to evaluate general visual representations. So far 16 different architectures were evaluated, involving supervised, semi-supervised, self-supervised, and generative models. All models these models are pre-trained on ImageNet. For comprehensive comparison also from-scratch models (which use no pre-training) are included.
The VTAB benchmark consists of 19 diverse tasks coming from a different domain. The authors divide them into 3 groups: natural, specialized, structured. More details about the setup of benchmark and conclusions you may find in our blog Should you even bother with pre-trained vision models?.

VTAB leaderboard of architectures for selected four data sets. In the first column, there are average results across all data sets. Source: https://google-research.github.io/task_adaptation/benchmark

We would like to draw attention to presentations of results: for every data set, scores are provided for every architecture but for aggregation, the mean value is used. Carefully looking at the range of scores across datasets we find that this operation may be misleading. For one task (dSpr-Loc) most models have the highest score of 100 and the difference equals 2 is very noticeable while for another task (EuroSAT) the differences between consecutive models are bigger because of the dispersion of metrics. When we look at model ranking for one dataset, the ordering is clear and linear. The only questionable point is the statistical significance of differences. But creating a consistent rating for two or more data sets may be difficult because of the nonlinear relationship.

To address these problems we propose Elo-based Predictive Power (EPP) ranking method.

Elo measure

The Elo ranking system is used for calculating the relative skill levels of players in games, such as chess or soccer but Elo is also popular in MOBA (Multiplayer online battle arena) games.

The difference in Elo scores of two players is a predictor of their match result. Elo is calculated based on the player’s historical wins and losses. After each match winner gains Elo points and the loser loses points. The amount of gained/lost points depends on the strength of the opponent. Winning against a better player gives more Elo points. The most important property of Elo is that the difference between two scores can be transformed into the probability of a player’s win against the opponent.

The idea of the Elo rating measure can be transferred into the Machine Learning world. The EPP — a concept of Elo for ranking ML models, is in the diagram below. Colors represent machine learning algorithms, gradients represent sets of hyperparameters, border styles represent data sets.

One can think of the ratings of models as ratings of players in the tournaments with the Elo system. Each data set is a tournament. Each algorithm can have different values of hyperparameters, such as countries that have players who represent them. Sets of hyperparameters (players) are compared on different data sets (tournaments) divided into train/test splits (rounds). There might be only one split such as in VTAB. The measures of model performances on test splits (results of matches) are aggregated into the Elo ratings. Elo for machine learning models we call EPP due to differences in the way the rankings are estimated.

EPP Ranks for VTAB

Below, we show the comparison of the mean score and EPP for models included in the Visual Adaptation Benchmark. Each black dot represents one model, the overall trend for the mean score and EPP is similar, however, there are some differences in the rankings. For example, Semi-Rotation-10% has a higher mean than Rotation, but lower EPP. It is caused by the fact that EPP only takes into account whether a model was better or worse than another, while the mean depends on the difference in results.

As we can see, the mean for the top 2 models is almost the same, but with the EPP scores, we can calculate the probability that on a new data set Sup-Rotation-100% will perform better than Sup-Exemplar-100%. The probability of winning is the inverse logit of the difference of scores. Therefore Sup-Rotation-100% (EPP=3.41) will obtain higher performance than Sup-Exemplar-100% (EPP=3.16) with the probability equals exp(3.41–3.16)/(1+exp(3.41–3.16)) = 0.56.

To read more about Elo see our preprint: Interpretable Meta-Measure for Model Performance.

To read more about Visual Task Adaptation Benchmark see preprint: A Large-scale Study of Representation Learning with the Visual Task Adaptation Benchmark

Elo Ranks for Visual Task Adaptation Benchmark

Written by Katarzyna Woźnica