Benchmarking Metric Learning Algorithms the Right Way

5 min readNov 25, 2019

Go here for the code and latest updates, and check out the associated paper, A Metric Learning Reality Check.

The typical metric learning paper presents a new loss function or training procedure, and then shows results on a few datasets, like CUB200, Stanford Cars, and Stanford Online Products. Every couple of months, we see the accuracy improve like clockwork.

Great, but there are a few caveats.

Here’s a random graph to keep your attention

Some papers do not compare apples to apples

In order to claim that a new algorithm outperforms existing methods, it’s important to keep as many parameters constant as possible. That way, we can be certain that it was the new algorithm that boosted performance, and not one of the extraneous parameters. This has not been the case with metric learning papers:

Network architectures have not been kept constant. Some papers use GoogleNet. Many recent papers have been using BN-Inception, sometimes referred to as “Inception with Batch Normalization”. One widely-cited paper from 2017 used ResNet50, and then claimed huge performance gains. This is questionable, because the competing methods used GoogleNet, which is a less powerful architecture. Thus, much of the performance gain likely came from the choice of network architecture, and not their proposed method.
Image augmentations have not been kept constant. Most papers claim to apply the following transformations: Resize the image to 256 x 256, randomly crop to 227 x 227, and do a horizontal flip with 50% chance. But the official open-source implementations of some recent papers, show that they are actually using the more sophisticated cropping method described in the original GoogleNet paper (see “Training Methodology”).
Performance-boosting tricks have been used without being mentioned in the paper. In the official open-source code for a recent 2019 paper, the trunk-model’s BatchNorm parameters are frozen during training. This can help reduce overfitting, and the authors explain that it results in a 2-point performance boost on the CUB200 dataset. Yet this is not mentioned in their paper.

Accuracy (Recall@1) of models pretrained on ImageNet. Output embedding sizes were reduced to 512 using PCA. For each image, the smaller side was scaled to 256, followed by a center-crop to 227x227.

Most papers are using a simple train/test split

They train on a portion of the data, find the model that performs best on the test set, and report that number. In other words, they do not use a validation set. So hyperparameters have been tuned and entire algorithms have been created with direct feedback from the test set. This breaks the most basic rule of Machine Learning 101. Moreover, the same single train/test split has been used for years. Over time, these two factors have likely lead to overfitting on the test set.

So let’s benchmark these algorithms correctly

This is where the powerful_benchmarker comes in.

Why use this tool?

Transparency. Every experiment you run comes with detailed config files that show exactly what models, losses, transforms (and more) were used. So now we can compare methods fairly.

Better performance metrics. Use metrics that are more informative than Recall @ 2, 4, 8, 100, 1000 etc.

Measure accuracy the correct way. Measure accuracy on multiple class-based train/val/test splits. Or you can use the old 50/50 train/test split for comparison purposes.

Detailed record keeping. View in-depth information about the training process on Tensorboard.

Flexibility with config files. Control most aspects of your experiment via config files. Extend existing config files by merging them with new ones. Here’s an example of how models are specified:

models:
   trunk:
      bninception:
         pretrained: imagenet
   embedder:
      MLP:
         layer_sizes: 
            - 512

And here’s how you merge 3 config_general files:

python run.py \
--experiment_name test \
--config_general default daml train_with_classifier

Flexibility at the command line. Specify complex config options using standard Python dictionary notation:

python run.py \
--experiment_name test \
--optimizers {metric_loss_optimizer: {SGD: {lr: 0.01}}}

See this for more details.

Flexibility with algorithms. Mix and match losses, mining functions, samplers, and training methods. Want to use the multi-similarity loss with the batch-hard miner? No problem:

loss_funcs:
  metric_loss: 
    MultiSimilarityLoss:
      alpha: 0.1
      beta: 40
      base: 0.5mining_funcs:
  post_gradient_miner: 
    BatchHardMiner: {}

Access to all models in the torchvision and pretrainedmodels packages. In your config_models file, just specify the function name as it appears in torchvision or pretrainedmodels.

Access to all losses in torch.nn and pytorch_metric_learning. In your config_loss_and_miners file, just specify the class name as it appears in torch.nn or pytorch_metric_learning.

Does it matter?

In the table below are the results from a selection of metric learning papers published in CVPR 2019 and ICCV 2019. Each color represents a different model and embedding size configuration. Because there is no standard way of running experiments, it is difficult to compare the performance of the various algorithms. This impedes research progress, because we don’t know which method really works best. Hence, it’s important to have a benchmarking tool that enables us to do fair comparisons.

**Green**: BN-Inception, 512. **Blue**: Resnet50, 128. **Yellow**: Resnet50, 512. **Red**: GoogleNet, 512. The numbers for the first 8 rows come from the respective papers.

In the bottom part of the table are results that were obtained using the benchmarking tool. Both the triplet loss and contrastive loss come close to the state-of-the-art. Yet these two are often left out of results tables, or are purported to be among the worst performing methods. The powerful benchmarker makes it easy to check these baseline algorithms.

(To view the config files for these experiments and others, see this spreadsheet, which I’ll be adding to over time.)

Thoughts?

Let me know what you think about this tool and the state of metric learning. If you have any questions or want to add features, visit the GitHub repos for powerful_benchmarker and pytorch_metric_learning.

Acknowledgements

Thank you to Ser-Nam Lim at Facebook AI, and my research advisor, Professor Serge Belongie. This project began during my internship at Facebook AI where I received valuable feedback from Ser-Nam, and his team of computer vision and machine learning engineers and research scientists.