A few #ICLR2022 highlights
With so many artificial research conferences on the calendar and so many papers being published, it is easy to miss out on interesting research. The ICLR 2022 conference is already two weeks in the past and most twitter chatter has already shifted to ICML acceptance announcements and the upcoming IEEE CVPR madhouse (forgive all the acronyms, is the nature of the field). Yeah well before I forget everything I learned, figured I might as well crystalize in this kind of unstructured essay, in which I’ll follow the form of merely summarizing in my own words those papers I perused along the way. Think of these as sort of alternative abstracts, with at times less formal tone, kind of just summarizing the summaries. That kind of thing. (I figured better get this out quick before Google automates this application with their language models — it won’t be long). Yeah so presented in no particular order. Abstract abstractions.
Diffusion-Based Voice Conversion with Fast Maximum Likelihood Sampling Scheme — Vadim Popov, Ivan Vovk, Vladimir Gogoryan, Tasnima Sadekova, Mikhail Sergeevich Kudinov, Jiansheng Wei
Voice conversion is a kind of deep fake, where some arbitrary person could speak a passage and it be translated to the tone and vocal inflections of a target faked individual, e.g. a political figure. Traditionally this type of capability has been limited to those individuals of sufficiently high profile that there is a large corpus of recordings available for training — e.g. those who have recorded a lot of speeches or performed in entertainment. The methods discussed here are capable of “one-shot” learning, where the voice you are trying to fake only has one or a few utterances recorded for training (i.e. “one shot many-to-many voice conversion”). Recent prior work exists for one shot voice conversion using auto-encoders, e.g. the Auto VC model (Qian et al 2019) and several improvements on that architecture since. A challenge for these methods is disentangling the target voice characteristics from the spoken words, which for the AutoVC is supported by the introduction of an information bottleneck, and extensions have incorporated other features. This paper proposes a new approach for disentanglement by having an encoder predicting “average voice”, such that by translating the target voice samples to an average voice of known characteristics it becomes easier to disentangle the grammatical features from the vocal characteristics before reversion by a separately trained decoder.
That is the plain language description, in practice there are some complex elements that are less translatable, e.g. the encoder maps to an average Mel spectrogram using a Montreal Forced Aligner to map to phonemes. One of the contributions of this paper is a novel stochastic differential equation solver suitable for diffusion based probability models used in the decoder method (which may also become of potential use for other generative applications), which they demonstrate as producing improved likelihoods at a faster rate.
Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution — Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, Percy Liang
This is an important paper benchmarking fine-tuning between two common methods. Fine tuning is a very common operation in industry and refers to the practice of adapting a pre-trained model to a new task by retraining one or more layers of previously derived weights to the new training data (for example one could take an image classification model trained on ImageNet dataset and adapt to a classification task of categories that weren’t present in ImageNet). The two benchmarked approaches for fine tuning have been well studied and are “fine tuning” (updating all model layers) and linear probing (only updating the last layer). In prior work fine tuning has generally been considered superior due to outperforming linear probing on in distribution (ID) data, with linear probing preferred for analyzing properties of representations. This paper extends considerations to take into account performance on out of distribution (OOD) data, where it is found that linear probing very significantly outperforms fine tuning in that setting (on order of 6% to the performance metric in their benchmarks). The paper proposes a middle ground of a two step tuning regime of first applying a linear probing to initialize the final layer followed by a full fine tuning on the weights (they call this approach LP-FT).
More specifically, traditionally macro fine tuning adds a randomly initialized layer head onto the pre-trained model. The LP-FT serves as a second step of initialization to that random head. That this could be so significant is a result of training dynamics arising from the random initialization causing changes to the feature extractors and loss of OOD properties realized during pre-training. The result is that this two step LP-FT fine tuning regime not only outperforms LP for OOD data, it also outperforms FT for ID data. Best of both worlds. And it is simple to perform so no reason not to.
Note that for traditional fine tuning, the gap between in distribution and out of distribution performance appears to grow with increasing sophistication of representations, suggesting that this LP-FT approach may be even more beneficial as we continue to scale up large language models.
Representational Continuity for Unsupervised Continual Learning — Divyam Madaan, Jaehong Yoon, Yuanchun Li, Yunxin Liu, Sung Ju Hwang
Continual learning as benchmarked in this paper appears to be focused on video modality for sequential images. Prior continual learning state of the art appeared to focus on a supervised form of continual learning. Authors demonstrate that with an alternate unsupervised form of continual learning, not only are they able to realize the benefits of unsupervised learning e.g. for scaling up to real world big data, but also the performance itself is improved, with better feature representations and less forgetting across sequences. (The feature map visualizations in Figure 4 paint a pretty clear picture of benefit vs prior art.)
The innovation for the unsupervised continual learning approach appears to be associated with combining recent work of
- the SiamSiam model of siamese networks for learning representations by an encoder composed of a backbone shared across a multi layer perceptron with a prediction head that minimizes sine cosine similarity
- The BarlowTwins model that minimizes the redundancy between embedding components, which improves SiamSiam by replacing a “stopgrad” component with a cross correlation matrix derived from two identical networks with distorted views of current batch.
Weighted Training for Cross-Task Learning — Shuxiao Chen, Koby Crammer, Hangfeng He, Dan Roth, Weijie J Su
Ok first am going to try and explain what I think is happening here. TAWT stands for task aware weighted training. In mainstream big data self supervised applications, like for large NLP models like BERT, training signal is often a weak signal from cross task data, e.g. if you have a big text corpus some text is magazine articles, some is academic papers, some might be recipes, you get it it’s all very different types of text and some may be more relevant than others to different tasks. In this application, you have a subset of text corpus associated with a specific task, let’s say that target task is coding for instance. Now there are a few ways you could take advantage of the weak signal from the large text corpus in training for your target task of coding. One is to first train of the large corpus and then fine tune to the target task, another is to jointly train between the large corpus and target task. TAWT comes into play for both cases by weighting the importance of the large corpus samples to your target task based on a estimate of task distance between large corpus samples and target corpus samples, and I think it is specifically the large corpus sample importance that receives weight adjustments towards training. Benefits are demonstrated in both paradigms of pretraining/fine-tuning or joint training. It results in improved sample efficiency of the large corpus and improved performance of the resulting target task.
Of course to take advantage of this in the pertaining/finetuning regime that means you will need to be aware of the target task at time of pretraining. So presumably you can’t just take an off the shelf BERT model and apply this method.
I expect there may be a tradeoff towards generalization properties, e.g. if you are trying to train the next GPT-3 you wouldn’t necessarily want your model to be overly tuned to one downstream task. But if you are primarily interested in a model for use towards code generation, this is less relevant.
Part of this work is associated with a new type of representation-based task distance between domains. Similar tasks, e.g. writing recipes or specifications and writing code, may have a short distance. More diverse tasks like poetry and code you would expect to have a large distance.
Sparse Communication via Mixed Distributions — António Farinhas, Wilker Aziz, Vlad Niculae, Andre Martins
The real abstract is a good summary. Note that these hybrid discrete continuous variables (what they call “mixed random variables”) are not new, the novelty here is associated with formalizing the theoretical underpinnings, and in the process establishing a new framing of KL divergence to evaluate such variable types. Figure 1 is really helpful to understand the utility of such as framing, as continuous relaxations of mixed variables like Logistic-Normal distributions when representing a probability vector over logits fails to capture any density on outlier faces, while their Gaussian-Sparsemax supports assigning probability to a full simplex.
KL Divergence is an immensely important informational measure throughout ML applications, and any fundamental reframing such presented here this could have wide interest.
Efficiently Modeling Long Sequences with Structured State Spaces — Albert Gu, Karan Goel, Christopher Re
Addresses question of how to handle long range dependencies, which traditional sequence learning frameworks (LSTM, Transformers, etc) struggle with.
From the real abstract:
“A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM)
x′(t)=Ax(t)+Bu(t),y(t)=Cx(t)+Du(t), and showed that for appropriate choices of the state matrix A, this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning A with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel.”
ProtoRes: Proto-Residual Network for Pose Authoring via Learned Inverse Kinematics — Boris N. Oreshkin, Florent Bocquelet, Felix G. Harvey, Bay Raitt, Dominic Laflamme
Pose estimation refers to inferring location and orientation of human joints in 3D space from a 2D image or video which can be used to simplify a modeled representation to just a few abstracted line vectors associated with torso and limbs. In many graphics applications these vector skeletons can serve as targets for controlling movement simulations and then post-processing can flesh out to a full visualization including flesh, cloths, hair, etc based on that input. So one potential application could be to infer pose estimation of a cinematic video and use the vector skeleton as a basis for special effects.
When you see movie production studios demonstrate their special effects operations you usually see actors in special suits with location sensor aids surrounding. This paper’s method is intended for use towards natural images without special visualization support.
- Authors developed some new benchmarks for pose estimation task
- Authors developed a new architecture for pose estimation that “ combines residual connections with prototype encoding of a partially specified pose to create a new complete pose from the learned latent space” (using transformers).
Comparing Distributions by Measuring Differences that Affect Decision Making — Shengjia Zhao, Abhishek Sinha, Yutong He, Aidan Perreault, Jiaming Song, Stefano Ermon
A few excerpts in original authors’ language:
“Integral probability metrics (IPMs, Muller (1997)) and f-divergences (Csiszar, 1964) are widely used discrepancies in machine learning. IPMs, such as the Wasserstein distance, maximum mean discrepancy (MMD) (Rao, 1982; Burbea & Rao, 1984; Gretton et al., 2012), are based on the idea that if two distributions are identical, any function should have the same expectation under both distributions. IPMs are used to define training objectives for generative models (Arjovsky et al., 2017), perform independence tests (Doran et al., 2014), robust optimization (Esfahani & Kuhn, 2018) among many other applications. f-divergences, such as the KL divergence and the Jensen Shannon divergence, are based on the idea that if two distributions are identical, they assign the same likelihood to every point. One can then define a discrepancy based on how different the likelihood ratio is from one. KL divergence underlies some of the most commonly used training objectives for both supervised and unsupervised machine learning algorithms, such as cross entropy loss”
“We propose a third category of divergences called H-divergences that overlaps with but also extends the set of integral probability metrics or the set f-divergences. Intuitively, H-divergence compares two distributions in terms of the optimal loss for a certain decision task. This optimal loss corresponds to a generalized notion of entropy (DeGroot et al., 1962). Instead of measuring the best average code length of any encoding scheme (Shannon entropy), the generalized entropy uses arbitrary loss function (rather than code length) and set of actions (rather than encoding schemes), and is defined as the best expected loss among the set of actions. In particular, given two distribution p and q, we compare the generalized entropy of the mixture distribution (p+q)/2 and the generalized entropy of p and q individually. Intuitively, if p and q are different, it is more difficult to minimize expected loss under the mixture distribution (p + q)/2, and hence the mixture distribution should have higher generalized entropy; if p and q are identical, then the mix”
“scientists and policy makers are often interested not only in if two distributions are different, but how two distributions are different and whether the differences affect decision making. Typical divergence measures (such as KL) or two sample tests only quantify if two distributions are different, while we show that H-divergence is a useful tool for quantifying how distributions are different” … “By choosing suitable loss functions”
Domino: Discovering Systematic Errors with Cross-Modal Embeddings — Sabri Eyuboglu, Maya Varma, Khaled Kamal Saab, Jean-Benoit Delbrouck, Christopher Lee-Messer, Jared Dunnmon, James Zou, Christopher Re
This paper is trying to break training problem down into different sub regions of the latent space to help compensate for under performing segments. The application is for self supervised multi-modal large transformer models, and the slicing is derived based on categories and subsegments of those categories. The secret sauce here is how they derive those slices i.e. “slice discovery methods” by a rigorous large scale quantitative evaluation. They do so by projection to a multi-model embedding which is the portion evaluated. In parallel, they take advantage of the built in language modality to generate natural language descriptions of those slices.
CycleMLP: A MLP-like Architecture for Dense Prediction — Shoufa Chen, Enze Xie, Chongjian GE, Runjian Chen, Ding Liang, Ping Luo
Fully connected layer convolutions in CNN (for image recognition) allow models to handle images with different resolutions between training and inference. Previous versions either connected convolutions along a channel or along a spatial dimension. The proposed cycle fully-connected layer improves on both methods, the tradeoffs are cleanly detailed in the description of Figure 1, and the benchmarks for performance and complexity scaling shown in Table 1.
To summarize: the cycle method was top benchmark and matched lowest complexity scaling. Cycle had benefit over the channel method in that it has a larger receptive field so is capable of learning more spatial context. Cycles also had benefit over Spatial fully connected in that it does not have a fixed parameter size.
Open-Set Recognition: A Good Closed-Set Classifier is All You Need — Sagar Vaze, Kai Han, Andrea Vedaldi, Andrew Zisserman
I fell for the clickbait of the phrase is all you need in the title.
Open set recognition classification problems differ from closed set recognition in that they are capable of accepting any form of input, even categories that were not found in training. In practice, translating a closed set application to an open set recognition is simply a matter of adding an additional category in inference representing “this category was not found in training”. I initallly thought the key insight of this paper was that they linked the performance of a closed set framing to open set framing, such that models with capacity to distinguish whether a category is consistent with any or none of the training labels performed better in the closed set task as well. However after taking a second look it appears to be the reverse, as should have been apparent from inclusion of the term is all you need which is very appropriate in this context. Apparently they are demonstrating that improving performance of the closed set application automatically translates in the same model to improved open set distinction. Which aligns with intuition.
iLQR-VAE : control-based learning of input-driven dynamics with applications to neural data — Marine Schimel, Ta-Chu Kao, Kristopher T Jensen, Guillaume Hennequin
Studying neural dynamics is limited by extent of sensor input. We often only have access to limited frequency range and neural adjacency range. So in order to model dynamics with respect to stimuli and actions, we need to be able to infer dynamics beyond the scope of measurement. As neural activity is to some extent a collective system, as in a fully connected network from a set of neurons (at least fully connected within brain regions, and otherwise at a minimum connected by conduits between regions), we can expect that to some extent the unmeasured scope may be recoverable from measurements within proximity of a sensor (this expectation is valid both for internal measurements and to a lesser extent what can be inferred by actions, e.g. variations in input that led to primate behavior switching events).
Neural dynamics are fundamentally a complex system with non-linear dynamics. Learning such inherent dynamics can be aided by framing as a control problem, and the architecture discusses in this paper (“iLQR-VAE”) extends prior work (LFADS) on VAE latent observation encoding driving an RNN generator by replacing a secondary bidirectional RNN as encoder for inference with an optimization-based recognition model from iLQR (a linear quadratic regulator algorithm), which stabilizes training, prevents posterior collapse, and greatly reduces number of hyperparameters.
The application of the method could extend to many fields with stochastic nonlinear dynamical systems, here the focus was on applications in neuroscience.
Language modeling via stochastic processes — Rose E Wang, Esin Durmus, Noah Goodman, Tatsunori Hashimoto
Current generations of language models often fail to produce convincing long form text by simply following next word generation arcs. Authors propose to introduce Brownian bridge trajectories that sample to select starting and ending points in an arc that provide a structure for the inference to stochastically flesh out, which they claim results in more coherent long form trajectories. They refer to the practice as Time Control.
My suggestion for extension is that if we want our NLP capabilities to approach the colors of human speech, it is not enough to bridge a trajectory between statements. Consider the paper “MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical Modeling” by Wu et al. They looked at MIDI encoded performances and sought to vary musical tones and timbers across a variety of axes . The notes could be played from pianissimo to forte, staccato, accented, etc. — each in order to closer resemble the complexity embedded by human performers. Such trajectories could be considered for language generation as well, perhaps eventually enabling the difference between Kernighan & Ritchie verses Kurt Vonnegut.
Natural Language Descriptions of Deep Visual Features — Evan Hernandez, Sarah Schwettmann, David Bau, Teona Bagashvili, Antonio Torralba, Jacob Andreas
It was surprising to me to see authors demonstrate just how specific can be the interpretability potential of even single neurons in a feedforward network. It is well known that human neurons may be tied to specific properties, eg a neuron that activates for a specific celebrity or etc, but I had assumed that for e.g. image modality this may have only progressed down to the filter portion of convolutional architecture (not sure if filter is the right word). Here they are using mutual information between neuron activation strengths and image content to identify correlated image properties, which are then compared to a custom annotation data set they built for this purpose to generate natural language descriptions in bulk for specific neurons. They call this method MILAN for mutual information-generated language annotated neurons, and they refer to the annotations themselves as MILANNOTATIONS. They say this is the first that individual neurons have been categorized in bulk in such a fashion, method appears tied to the image modality though.
Einops: Clear and Reliable Tensor Manipulations with Einstein-like Notation — Alex Rogozhnikov
Tensor manipulations are a common debugging pain point in mainstream DL frameworks, where chains of operations like transpose and reshape could be easy to mis-specify by mixups between order of tensor dimensions at intermediate steps, which types of error may not halt operation and only be identified by manually tracing through the steps. Further complicating, most each library ( like TensorFlow, PyTorch) has a unique specification conventions, and attempts to address by labeling axes are not widely used or fully supported.
Einops seeks to simplify by having a common interface that abstracts between libraries for tensor reshapings, with a string representation of axes and their order that can be returned as an arbitrary window view instead of requiring complicated derivations between configurations (e.g. “a, b, c -> b, c, a”). For a demonstration of how much complexity is abstracted away, consider the code demonstration on OpenAI’s Glow model on page 7. Although it wasn’t included in their demonstration, the library does have jax support.
A fun takeaway though was from the program chairs’ assessment, offered here verbatim:
The negative reviewers appear fixated on the (true) observation that the paper does not look like a conventional ICLR paper, that it “reads like a technical blog”, and “lacks rigour”.
I believe it is fair and measured to state that these reviews may be considered to exhibit aspects of gatekeeping: requiring more “mathiness” that does not help the paper, or more “rigour” through user studies that are in fact less valuable than the reviewers’ own observations “I could see myself…”, “I tend to buy…”.
This is a paper about design, not about models or algorithms (although the algorithmic work is good). It is about the design of tools that we all use, and about the decisions and thought processes that led to that design. A reviewer decries “many non-rigorous claims”. These are claims about the usability of existing systems, and mostly appear in the discussion and footnotes, as the authors note in rebuttal. Of course, one could have run user studies to back up each claim, but I am just as convinced by the examples shown in the paper. It matters not to me what some users corralled into a user study thought. It matters what I and my colleagues will think, and I am now sure to recommend einops to colleagues. I would not have met it had the paper not been submitted to ICLR, and hence I am certain it should be accepted, so more can see that we care not just about mathiness, but actually enabling progress in our field.
The job of a conference like ICLR is to expose researchers and practitioners in machine learning to ideas and techniques that may advance their research and practice. Programming, and the translation of mathematical ideas to efficient computer code, are fundamental to all of machine learning, and hence programming models are very much suitable for presentation to an ICLR audience.
Frame Averaging for Invariant and Equivariant Network Design — Omri Puny, Matan Atzmon, Edward J. Smith, Ishan Misra, Aditya Grover, Heli Ben-Hamu, Yaron Lipman
I was a little out of my depth on this paper. I had trouble visualizing what was meant by symmetries of invariance or equivariance and that made most of the rest of discussions even harder to follow along. The easiest part of the paper to understand for me was the abstract, so just defer to that summary is better than I could offer.
Chen, S., Crammer, K., He, H., Roth, D., and Su, W. J. Weighted training for cross-task learning. In International Conference on Learning Representations, 2022a. URL https://openreview.net/forum?id=ltM1RMZntpu.
Chen, S., Xie, E., GE, C., Chen, R., Liang, D., and Luo, P. CycleMLP: A MLP like architecture for dense prediction. In International Conference on Learning Representations, 2022b. URL https://openreview.net/forum?id=NMEceG4v69Y.
Eyuboglu, S., Varma, M., Saab, K. K., Delbrouck, J.B., Lee-Messer, C., Dunnmon, J., Zou, J., and Re, C. Domino: Discovering systematic errors with crossmodal embeddings. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=FPCMqjI0jXN.
Farinhas, A., Aziz, W., Niculae, V., and Martins, A. Sparse communication via mixed distributions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=WAid50QschI.
Gu, A., Goel, K., and Re, C. Efficiently modeling long sequences with structured state spaces. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=uYLFoz1vlAC.
Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., and Andreas, J. Natural language descriptions of deep features. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=NudBMY-tzDr.
Kumar, A., Raghunathan, A., Jones, R. M., Ma, T., and Liang, P. Fine-tuning can distort pretrained features and underperform out-of-distribution. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=UYneFzXSJWh.
Madaan, D., Yoon, J., Li, Y., Liu, Y., and Hwang, S. J. Representational continuity for unsupervised continual learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=9Hrka5PA7LW.
Oreshkin, B. N., Bocquelet, F., Harvey, F. G., Raitt, B., and Laflamme, D. Protores: Proto-residual network for pose authoring via learned inverse kinematics. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=s03AQxehtd_.
Popov, V., Vovk, I., Gogoryan, V., Sadekova, T., Kudinov, M. S., and Wei, J. Diffusion-based voice conversion with fast maximum likelihood sampling scheme. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=8c50f-DoWAu.
Puny, O., Atzmon, M., Smith, E. J., Misra, I., Grover, A., Ben-Hamu, H., and Lipman, Y. Frame averaging for invariant and equivariant network design. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=zIUyj55nXR.
Rogozhnikov, A. Einops: Clear and reliable tensor manipulations with einstein-like notation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=oapKSVM2bcj.
Schimel, M., Kao, T.-C., Jensen, K. T., and Hennequin, G. iLQR-VAE : control based learning of input-driven dynamics with applications to neural data. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=wRODLDHaAiW.
Vaze, S., Han, K., Vedaldi, A., and Zisserman, A. Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=5hLP5JY9S2d.
Wang, R. E., Durmus, E., Goodman, N., and Hashimoto, T. Language modeling via stochastic processes. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=pMQwKL1yctf.
Zhao, S., Sinha, A., He, Y., Perreault, A., Song, J., and Ermon, S. Comparing distributions by measuring differences that affect decision making. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=KB5onONJIAU.