State of Representation Learning — ICLR 2022

Published in

Georgian Impact Blog

13 min readMay 20, 2022

Visualization vector created by vectorjuice — www.freepik.com

From BERT to CNN, the need to automatically encode raw data into usable representations (aka representation learning) has been groundbreaking and integral to all machine learning applications.

This past month, over a thousand papers were presented at the 10th International Conference on Learning Representations (ICLR). ICLR is one of the leading academic computing conferences, focusing on topics such as representation learning (for vision, text, time series data, and others), learning from huge volumes of data, reinforcement learning, and more.

ICLR 2022 spanned five days with 12 poster sessions and nine oral sessions, along with 20 workshops and 8 invited talks. While all the material is recorded online, it can be hard to traverse all the information. In this post, you can read our summary of the conference, focusing on the following topics:

Dealing with distribution shift
Representation learning advancements
Demystifying models
Survival of models in the “wild”

Continue reading our blog so you don’t end up like this person! (credit: xkcd.com)

Dealing with Distribution Shift

As machine learning models are increasingly being employed across the world, the need to adapt models to different distributions becomes even more important to ensure their generalizability across different environments.

Here, we outline a few interesting papers that revolve around adapting models to different distributions.

Leveling Up NLP Fine-tuning

BERT embeddings (and its variants) have proven to be very useful in encoding text in different settings. However, when working with a very specific type of text, some tuning is needed to get the most out of these embeddings.

Due to the massive size of these models, people often opt for a parameter-efficient way of tuning, where only parts of the model are further trained with the target data distribution. The ability to fine-tune a model by only re-training a small number of its parameters is very attractive, especially considering the massive size of modern NLP models. However, these techniques usually cannot recover the performance of a fully-tuned model (where the whole model is further trained with the target data distribution).

In [1], He et al. dove deep into three different parameter-efficient tuning methods, such as prefix tuning, adapter, and LoRA. They dissected the mathematical formulation underlying these methods and formulated a new way of fine-tuning that combined these three approaches together. This unification approach is named Mix-And-Match (MAM) Adapter (github.com/jxhe/unify-parameter-efficient-tuning). Models tuned with the MAM Adapter were able to compete with a fully-tuned model.

While this is a great achievement, some might be concerned with using their data to fine-tune, as it can unpleasantly expose their data. If you’re concerned about the privacy of your data during fine-tuning, check out this paper by Li et al. where they explored how to efficiently execute differentially-private (DP) fine-tuning [2]. Their paper gives insights on which learning rate and batch size to use, what models work well in this setting, as well as further enhancement to reduce the memory consumption of the DP optimization process.

NLP Zero-Shot Models: How Doers Get More Done

Ever since GPT-3 came out, many people (us included) have been using GPT-3 as zero-shot predictions and tweaking prompts to improve the outputs. Wei et al. [3] worked to improve this further and introduced a new process called instruction tuning.

Instruction tuning is a process of fine-tuning language models on a collection of datasets described via instructions, with the sole goal to improve their zero-shot performance. They take a 137B parameter NLP model and tune it with over 60 NLP instructions, which results in Fine-tuned LAnguage Net (FLAN). FLAN outperformed GPT-3 175B zero-shot model by 7–14%.

New & Improved Benchmarks on CV Distribution Shift

Before covering fine-tuning in Computer Vision (CV), let us take a breather and discuss some of the important advancements in benchmarking and measuring data shifts in Computer Vision.

Distribution shifts can come in different forms, e.g. change of label or feature distribution, different background images or orientation or context where the picture is taken, etc. There are many works exploring how to overcome each of these different types of distributions, and Wiles et al. has combined them together in this new benchmark [4]. They studied 19 different approaches looking at different distribution shift scenarios, e.g. unseen data shift, spurious correlation, and low data drift. Additionally, their framework also allows researchers to play around with how much shift exists in the data. Their experiments show that augmentation and pre-training are usually helpful although not in all cases. Check out their framework in https://github.com/deepmind/distribution_shift_framework .

Another benchmark that you might want to look at is the WILDS Benchmark which was initially released a couple of years ago by Koh et al (https://arxiv.org/abs/2012.07421). They recently extended this benchmark by adding unlabeled data to the training dataset [5]. This unlabeled data can come from the training distribution, test distribution, or from external distributions. Their findings show that current approaches trained on extra data often do not outperform models simply trained on the training data. More research is needed to be able to utilize these unlabeled data more efficiently in order to improve the generalization of the model.

One other distribution shift paper I found really interesting is a new class of metrics to measure shift. Currently, to measure distribution shifts, metrics such as Jensen Shannon, KL Divergence, MMD, Wasserstein distance, etc. take into account the whole image. However, it’s often the case that we only care about certain parts of the image. For example, if there is an image of a tiger in a forest and an image of a tiger in a desert, if we only care about the fact that there is a tiger, the background setting should not affect the distribution measure.

H-Divergence metrics take care of this problem by introducing the H-Entropy term to existing divergence metrics [6]. The H-entropy term is a Bayes loss that depends on the problem-specific action space. This H-entropy term is a generalization of the Shannon entropy and can be incorporated into existing divergence metrics.

The new class of H-Divergence metrics can be represented in the following way:

Enhancing CV Fine-Tuning: To Out-of-Distribution and Beyond!

Did you know that if you fine-tuned your model, your model can underperform on an out-of-distribution (OOD) dataset? This definitely makes sense, but why is this so? As explained by Kumar et al., it is because during fine-tuning, the embedding space changes more for in-distribution (ID) dataset, while the embedding space for OOD dataset does not change that much [7]. With linear probing, we avoid this problem by not changing the embedding space at all and freezing the pre-trained features. However, the performance on the ID dataset for linear probing cannot match fine-tuning.

How should we tune the models to work well with ID and OOD datasets? Kumar et al. showed that by simply conducting linear probing first followed by fine tuning, you can gain better performance than fine-tuning in your in-distribution dataset as well as your out-of-distribution dataset! They call this approach LP-FT, which allows for the embedding space of the features to change 10–100x less depending on the dataset, while increasing performance by over 10% in the OOD dataset.

Representation Learning Advancements

Vision Transformer

This summary is not complete without a mention of Vision Transformer (ViT), which was published in last year’s ICLR (https://arxiv.org/abs/2010.11929). The idea of vision transformers is to divide the images into patches and pass each of these patches as “words” into the transformer encoder. Many papers in this year’s ICLR enhanced ViT and built on top of it. Here are some papers to name a few:

BEiT [8]: Improvement on pre-training of ViT by masking a block of patches each time, analogous to n-gram masking in NLP.
ViTGAN [9]: Integration of ViT to GANs with introduction of novel regularization techniques to prevent instability, as well as an effective architectural choice for convergence.
EsViT [10]: Introduction of a multi-stage transformer architecture with sparse self-attentions to reduce model complexity, as well as a new pre-training task called non-contrastive region-matching that works well with this new architecture.
On Improving Adversarial Transferability of Vision Transformers [11]: Investigation on the effectiveness of adversarial attacks on ViT and introduction of a new way to effectively adversarially attack ViT with self-ensemble

Custom Transformers For Time Series and Outlier Detection

At Georgian, we also work on many projects involving time series data. So, this modification of transformer architecture is of particular interest to us. Pyraformer is a pyramidal version of transformer which reduces complexity while making it work well for long-range time series data [12]. It also allows for extraction of multi-resolution feature embeddings (where we can see the embeddings that capture the time series data on a higher level, more granular level, etc.), while still keeping the connection between nodes.

Another interesting work on transformers for time series data is the Anomaly Transformer [13]. Xu et al. adapts transformers for time series anomaly detection in an unsupervised setting. The Anomaly Transformer is made up of Anomaly-Attention mechanism blocks with a fully connected layer in between. Each Anomaly-Attention block captures the series-association (temporal context, period/trend, etc.) and prior-association (properties of adjacent time points) of each time series sequence.

Learnable Stride for CNN

We are all used to thinking of strides in CNN as a hyperparameter. Well, Riad et al. are not satisfied with this status quo and define a whole new type of layer called DiffStride, where the strides are learned and optimized during training backpropagation [14].

DiffStride is the first-ever downsampling/pooling layer with learnable strides. It allows for fractional reduction of tensors, building on prior work of Spectral Pooling (https://arxiv.org/abs/1506.03767) and a differentiable coping window. They showed that DiffStride converges to different strides than what was previously thought as the optimal stride, and outperformed CNNs without learnable stride layers.

Demystifying Models

Capturing Feature Importance Directions

We’re so used to thinking of explainability with scalar feature importances. Masoomi et al. extend the work on SHAP by looking at directional feature interactions [15]. This allows us to know which features are mutually redundant, or which features are the influencers in the group.

Post-hoc Examinations of CV Models

With everyone realizing that we need a better understanding of which groups of samples our model is underperforming on, comes a new benchmark called Domino [16]. Domino allows you to identify which image samples the models are underperforming on, group them together, and describe this group with an understandable description.

Wouldn’t it be nice to have these understandable descriptions for neuron visualizations? Often, to explain what a neuron is learning, we rely on visualization of inputs that cause neurons to activate. But, when there are so many of these images, it takes a lot of effort to analyze all of these images and come up with a meaningful description of what the neurons are learning overall.

MILAN (Mutual Information-guided Linguistic Annotations of Neurons) can give you nice descriptive text for these images [17]. They collected 52000 human descriptions of deep visual images, generated from 17,000 neurons (7 base models). This dataset is called MILANNOTATIONS. While these descriptions are useful, humans can sometimes be non-descriptive, e.g. not differentiating between stripes vs. plaid. While training the model descriptor, to encourage specific descriptions, they added a new term that downweight common descriptions. This descriptor generalizes across architectures, common datasets, and training tasks.

Survival of Models in the “Wild”

Differentially Private Hyperparameter Tuning

We’ve heard of differentially private training, but what about hyperparameter tuning? Is privacy leaking from hyperparameter tuning a valid concern? Well, Papernot and Steinke show that it can be a concern, because hyperparameters are sensitive to outliers. In this paper, they showed that by conducting hyperparameter tuning with a random number of trials, we can achieve Renyi-differential privacy [18].

Contrastive Learning — What Can Go Wrong?

Contrastive learning has been big in ML, especially since it allows us to not rely on expensive human labeling. For example, this year’s ICLR featured work on using contrastive learning to utilize partial labels [19]. But, Carlini and Tarzis warn us of the dangers of contrastive learning [20]. There are 2 types of classical attacks:

Poisoning attack: This can come in many forms. Any tampering of the data (modify features, flipping labels, adding samples with wrong labels, etc.) is considered poisoning.
Backdoor attack: This is usually done by: 1) Adding an artifact to a few samples, and 2) Intentionally changing the labels of those samples as another previously decided target label. With this, any image with artifact will be classified as the target label.

And the bad news is that while there’s been much work in supervised ML to overcome these attacks, there’s not as much in contrastive models. Only around 15 images out of millions of data are needed to make poisoning attacks 50% successful, and similarly with backdoor attacks [20]. This is very concerning and there needs to be work on enhancing the security of contrastive models to be resilient.

How Safe Is Federated Learning?

Continuing with the themes of attack, did you know that you can easily obtain private data in federated learning?

Fowl et al. introduced the “Imprint” module that can perfectly recover data in a federated learning setting [21]. With a linear layer and ReLU, they created identifiable “bins” that can recover identifiable gradient information. At a high level, for example, let’s say there’s a model with a layer summing for dataset < 0.5 and another layer summing for dataset < 0.51. By looking at the difference, you can identify datasets belonging in between 0.5 and 0.51. People often think that the federated learning setting helps protect each user’s data privacy. However, it is not enough, especially when the model is maliciously designed to obtain user data. We need user-level differential privacy to help defend against this attack.

Closing Thoughts

Hope you enjoyed this summary of ICLR 2022! While we cover many topics, there are many other exciting papers in various topics, such as meta learning, applications of ML in different settings, reinforcement learning, etc. that I have not covered in this summary.

Despite being an academic computing conference, I really appreciate how this year’s ICLR workshops and keynotes bring the scope of machine learning beyond computing. There are various discussions on how to apply machine learning to advance science, as well as discussions on the challenges of ML applications in countries with poor resources, that reminded us to think of the bigger picture of ML.

Papers Mentioned

[1] He, Junxian, et al. “Towards a unified view of parameter-efficient transfer learning.” arXiv preprint arXiv:2110.04366 (2021). https://arxiv.org/abs/2110.04366

[2] Li, Xuechen, et al. “Large language models can be strong differentially private learners.” arXiv preprint arXiv:2110.05679 (2021). https://arxiv.org/abs/2110.05679

[3] Wei, Jason, et al. “Finetuned language models are zero-shot learners.” arXiv preprint arXiv:2109.01652 (2021). https://arxiv.org/abs/2109.01652

[4] Wiles, Olivia, et al. “A fine-grained analysis on distribution shift.” arXiv preprint arXiv:2110.11328 (2021). https://arxiv.org/abs/2110.11328

[5] Sagawa, Shiori, et al. “Extending the WILDS Benchmark for Unsupervised Adaptation.” arXiv preprint arXiv:2112.05090 (2021). https://arxiv.org/abs/2112.05090

[6] Zhao, Shengjia, et al. “Comparing Distributions by Measuring Differences that Affect Decision Making.” International Conference on Learning Representations. 2021. https://openreview.net/forum?id=KB5onONJIAU

[7] Kumar, Ananya, et al. “Fine-tuning can distort pretrained features and underperform out-of-distribution.” arXiv preprint arXiv:2202.10054 (2022). https://arxiv.org/abs/2202.10054

[8] Bao, Hangbo, Li Dong, and Furu Wei. “Beit: Bert pre-training of image transformers.” arXiv preprint arXiv:2106.08254 (2021). https://arxiv.org/abs/2106.08254

[9] Lee, Kwonjoon, et al. “Vitgan: Training gans with vision transformers.” arXiv preprint arXiv:2107.04589 (2021). https://arxiv.org/abs/2107.04589

[10] Li, Chunyuan, et al. “Efficient self-supervised vision transformers for representation learning.” arXiv preprint arXiv:2106.09785 (2021). https://arxiv.org/abs/2106.09785

[11] Naseer, Muzammal, et al. “On improving adversarial transferability of vision transformers.” arXiv preprint arXiv:2106.04169 (2021). https://arxiv.org/abs/2106.04169

[12] Liu, Shizhan, et al. “Pyraformer: Low-complexity pyramidal attention for long-range time series modeling and forecasting.” International Conference on Learning Representations. 2021. https://openreview.net/forum?id=0EXmFzUn5I

[13] Xu, Jiehui, et al. “Anomaly Transformer: Time Series Anomaly Detection with Association Discrepancy.” arXiv preprint arXiv:2110.02642 (2021). https://arxiv.org/abs/2110.02642

[14] Riad, Rachid, et al. “Learning strides in convolutional neural networks.” arXiv preprint arXiv:2202.01653 (2022). https://arxiv.org/abs/2202.01653

[15] Masoomi, Aria, et al. “Explanations of Black-Box Models based on Directional Feature Interactions.” International Conference on Learning Representations. 2021. https://openreview.net/forum?id=45Mr7LeKR9

[16] Eyuboglu, Sabri, et al. “Domino: Discovering systematic errors with cross-modal embeddings.” arXiv preprint arXiv:2203.14960 (2022). https://arxiv.org/abs/2203.14960

[17] Hernandez, Evan, et al. “Natural Language Descriptions of Deep Visual Features.” arXiv preprint arXiv:2201.11114 (2022). https://arxiv.org/abs/2201.11114

[18] Papernot, Nicolas, and Thomas Steinke. “Hyperparameter Tuning with Renyi Differential Privacy.” arXiv preprint arXiv:2110.03620 (2021). https://arxiv.org/abs/2110.03620

[19] Wang, Haobo, et al. “PiCO: Contrastive Label Disambiguation for Partial Label Learning.” arXiv preprint arXiv:2201.08984 (2022). https://arxiv.org/abs/2201.08984

[20] Carlini, Nicholas, and Andreas Terzis. “Poisoning and backdooring contrastive learning.” arXiv preprint arXiv:2106.09667 (2021). https://arxiv.org/abs/2106.09667

[21] Fowl, Liam, et al. “Robbing the Fed: Directly Obtaining Private Data in Federated Learning with Modified Models.” arXiv preprint arXiv:2110.13057 (2021). https://arxiv.org/abs/2110.13057