NeurIPS 2019 Notes

14 min readJan 8, 2020

Industrial Talks
Bayesian Deep Learning Tutorial
Deep Learning Theory
Optimization and Regularization
Adversarial Image Synthesis and Robustness
Attention-based CNN Architectures
Interpretability and Visualization
Interesting Talk at Workshop

Disclaimer: The content of this notes does not fully reflect the trends at NeurIPS 2019. Something I am interested but overlooked, for example, RL, meta-learning, graph representation learning and causal inference. There are also some other great posts in the followings for different perspectives of NeurIPS 2019.

General statistics
Insights from paper submissions
Opening remarks (paper award stuffs)
Chip Huyen’s Key trends from NeurIPS 2019
David Abel’s intensive NeurIPS 2019 Notes (mostly RL):
All the talks (tutorials & keynotes & orals) are available!

Acknowledgement

First, I would like to thank my collaborators at NTHU and Academia Sinica Taiwan, Ting-I Hsieh, Hwann-Tzong Chen and Tyng-Luh Liu. Secondly, I would also like to thank my managers at MediaTek Taiwan, Yu-Lin Chang, Chia-Ping Chen and Shaw-Min Lei, who kindly support me to attend NeurIPS 2019. Without all the help, I won’t be able to document all the great stuffs I experienced at NeurIPS here!

Quick promotion! We are honored to have our paper “One-Shot Object Detection with Co-Attention and Co-Excitation” accepted at NeurIPS 2019. TL;DR. We present a simple and effective 3-step framework, co-attend, squeeze and co-excite, and rank, for the challenging task — one shot object detection (i.e., given one query image whose class label is never-seen during training, detect all instances of the same class in a target image). Go check out our PyTorch code on Github!

Sunday (12/8): Industrial Talks & Panels

Interpretability — Now What? by Google

Talk by Been Kim, who works as a senior research scientist at Google Brain, and is also an area chair at NeurIPS 2019. Her work mainly focuses on developing human-centric interpretable tools for deep learning model prediction and investigating fragility of saliency-based explanation methods for neural networks.

Sanity Checks for Saliency Maps. NIPS 2018 (spotlight)

When randomize weights, model makes garbage prediction. And when prediction changes, do explanation change? No!

Problem: Some widely used saliency methods for prediction explanation do not really reflect the evidence for prediction.
Propose two sanity check methods for explanation methods from the motivation: “When prediction changes, the explanation should change.”
Approach: Change prediction based on randomization test: 1) Model parameter randomization; 2) Data label randomization.
What we learned: We should be careful of confirmation bias when developing explanation method. Just because it “makes sense” to humans, doesn’t mean it reflects the evidence for prediction.

TCAV: Testing with Concept Activation Vectors. ICML 2018.

Problem: How much a concept (e.g., gender, race) was important for a prediction in a trained model…even if the concept was not part of the training? Can we quantitatively measure how important any of these user-chosen concepts are?
Propose TCAV to provide quantitative importance score of a concept if and only if your network learn about it. TCAV does not require to change or retrain your network.
Approach: Given a set of user-chosen concept images, random images and trained network, train a linear classifier to separate activations from concept or random images. Testing with CAV by using directional derivatives to measure “conceptual sensitivity” of class k to concept C.
Limitations: 1) Concept has to be expressed using image examples; 2) User needs to know which concepts they want to test for it. Follow-up work “Towards Automatic Concept-based Explanations” (NeurIPS 2019) to automatically discover concepts for images. (3) Explanations provided by TCAV are not casual. A Follow-up work on casual TCAV — “On Concept-Based Explanations in Deep Neural Networks” submitted to ICLR 2020.

Private Federated Learning by Apple

On-device training on user data using federated learning with differentiable privacy can be used to improve global models in the cloud. Apple has started to use this tech in iOS 13 for variety of use cases including QuickType keyboard, FoundIn Apps, and Personalized “Hey Siri”. More details can be refer to the paper “Protection Against Reconstruction and Its Applications in Private Federated Learning” and the talk at WWDC 2019.

How the Game Industry is Driving Advances in AI Research by Unity

The ML-Agents toolkit (left) and the Obstacle Tower Challenge (right).

Introduce Unity Machine Learning Agents Toolkit for training and deploying AI/RL-agents.
Introduce AI-based challenges, Obstacle Tower Challenge & The Animal-AI Olympics. See more detail at his official blog.

Monday (12/9): Tutorials

Deep Learning with Bayesian Principles [Video][Slides]

Talk by Emtiyaz Khan, who leads the Approximate Bayesian Inference (ABI) Team at RIKEN — AIP in Tokyo, and most of his work focuses on Bayesian deep learning. He has two interesting papers “Practical DL with Bayes” and “Approximate Inference Turns DNNs into GPs” accepted at NeurIPS 2019. And recently, he has been working on the paper version of this tutorial which may contains more detail.

TL;DR: Many existing optimization algorithms (e.g., SGD, RMSprop, Adam) in Deep learning and exact/approximate inference (e.g., Laplace, Variational Inference) in Bayesian learning can be derived from Bayesian principles (or “Bayesian learning rule”). We can leverage Bayesian principles to design better Deep learning algorithms for uncertainty, active learning and life-long learning.

The gap between human learning and most of the machine/deep learning

Problem: Human learns is in a sequential update way, a type of life-long learning. We continuously interact slowly with the environment, get small feedbacks, and continual to learn and improve our knowledge about the world. Even when the environment changes (non-stationary), we can still adapt and adjust. This kind of learning is very different from the kind we see right now. For example, Deep learning is really bulk learning. we believe that everything that we need to generalize in the world is all presented in the large amount of data (stationary), and then we suck all of the knowledge in the data to our network. Emtiyaz thinks that the best possible mathematical framework that explains human learning is Bayesian learning.
Important concept: Deep learning (DL) is “local/simple” method (trying to find one possible model → scale to large problem), where Bayesian learning is “global/complex” method (trying to find all possible model (posterior) utilizing supportive prior/belief → do not scale to large problem). Then to improve DL algorithms, we just need to add some “global” touch to the DL algorithms.

The intuition of derivation of DL algorithms from “Bayesian Learning rule”: Suppose we have 2D posterior distribution. From the simple one, we can approximate it with Gaussian (red circle) where we can only estimate the mean with fixed covariance matrix (i.e., moving the red circle around the 2D posterior). If we only estimate the mean using “Bayesian learning rule”, we get something like first-order method (e.g., SGD). If we also estimate the covariance (i.e., multi-variate Gaussian), we get second-order method (e.g., Newton). If we estimate each of the multi-variate Gaussian, we get Ensemble of Newton method. When approximating to exact posterior distribution, the “Bayesian learning rule” just becomes Bayes’ rule. See more in “Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations” ICML (2019).

Improving DL algorithms (e.g., RMSprop, Adam) by adding “Bayesian touch”: They propose Variational Online Gauss-Newton (VOGN) that learns like RMSprop/Adam but has uncertainty as side product! See more in their “Practical DL with Bayes” (NeurIPS 2019) paper.

Challenges: Recent paper “Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift” (NeurIPS 2019) contradicts the principles. They found that current Bayesian DL methods are not sufficient to estimate good uncertainty under dataset shift and the non-Bayesian ensemble still works the best. This tells us that current Bayesian DL methods may not be “global” enough! Especially for non-convex problem (DL), local approximation only captures “local uncertainty”. Computing better posterior approximation and better higher-order gradients are still remaining in open challenges.

Tuesday-Thursday (12/10-12/12): Main Conference

Deep Learning Theory

Uniform Convergence may be Unable to Explain Generalization in Deep Learning (Outstanding New Directions Paper): Existing generalization bound methods which based on uniform convergence (e.g., Rademacher complexity, PAC-Bayes, Covering Numbers, Compression), may not be lead us to explain why over-parameterized DNNs can generalize well. They found that: 1) Existing bounds grow with training set, which is empirically not true; 2) They use “hypersphere binary classification task” to prove that any uniform-convergence based generalization bound will fail to explain generalization. High-level idea here is that the decision boundary by SGD on over-parameterized DNNs can have certain complexities which hurt uniform convergence, but without hurting generalization.

A New Perspective of Understanding Deep Learning — Infinitely Wide Neural Network & Neural Tangent Kernel

Exact Computation (top) and Neural Tangents (bottom)

An infinitely wide (over-parameterized) neural networks can be approximated into linear models with a kernel called the Neural Tangent Kernel (NTK).
For a learning theory newbie, the direction here may be kind of overwhelming, especially NTK that just emerged from the last year’s NeurIPS. Here I would like to recommend some of the great posts that covers the NTK preliminaries: 1) Understanding the Neural Tangent Kernel by Rajat; 2) Ultra-Wide Deep Nets and Neural Tangent Kernel (NTK) by Wei Hu and Simon Du.

On Exact Computation with an Infinitely Wide Neural Network: This paper show how to exactly compute NTKs for CNNs (CNTKs) to let us simulate an infinitely wide CNNs. They found that: 1) the CNTK performance is correlated with CNNs; 2) The techniques that lead to performance improvement for CNNs (e.g., Global Average Pooling) can also improve the performance of CNTKs; 3) Theoretically the CNTK is the infinitely wide version of CNNs. However, there is still a performance gap between CNNs and CNTKs, which means that NTK could be only one of the directions to figure it out why over-parameterized DNNs generalize well.
Neural Tangents: Fast and Easy Infinite Neural Networks in Python: A Python library designed to enable research into infinitely wide neural networks. This software paper actually appeared in Bayesian Deep Learning Workshop on Friday (12/13). I place it here since due to the better content arrangement. Recently it has also been accepted in ICLR 2020. Checkout their paper and code!

Optimization and Regularization

Towards Explaining Reg. (top); Time matters and Asymmetric Valley (bottom)

Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks: This paper empirically studies why large initial learning rate (LR) + annealing is crucial for generalization. They perform an interesting experiment: Train a classifier on modified CIFAR10 where some images are attached with colored patches that identify the class (i.e., “class signatures”). They found that small LR will quickly memorizes these hard-to-fit class signatures, while the large LR will learn to use easy-to-fit patterns and start to memorize hard-to-fit patterns after annealing. The intuition of why large LR works like this may be due to the initial weak representation power caused by noisy gradients from large LR. Therefore, it can prevent overfitting to “signatures”.
Time Matters in Regularizing Deep Networks: This paper shows that applying regularization (weight decay or data augmentation) in initial training is crucial then in later training (i.e., regularization in deep networks does not work by re-shaping the loss function at convergence).
Asymmetric Valleys: Beyond Sharp and Flat Local Minima: This paper proposes a theory that there exists many asymmetric directions (flat or sharp) at a local minimum. They prove that a solution biased towards the flat side generalizes better. Then, they show that averaging SGD gradients implicitly induces such biased solutions (this explains why averaging SGD gradients generalizes better!)

Adversarial Image Synthesis and Robustness

Image Synthesis (left) and Adv. Mixup (right)

Image Synthesis with a Single (Robust) Classifier: This paper presents an very interesting idea that by only using a single robust classifier, we can perform diverse image synthesis tasks (e.g., generation, super-resolution, in-painting, image-to-image translation, style transfer). The ideas are:1) Optimize input to increase class score; 2) Due to the existence of adversarial examples, the classifier must be robust enough to be invariant to small input changes.
On Adversarial Mixup Re-synthesis: This paper extends the idea of mixup. They produce novel realistic images from mixed latent code from two or more input images with different classes or attributes using GANs. The learned mixed representation is shown to improve downstream tasks. There is another NeurIPS 2019 paper called “On Mixup Training: Improved Calibration and Predictive Uncertainty for Deep Neural Networks”, which also leverages mixup to improve predictive uncertainty.
Adversarial Examples Are Not Bugs, They Are Features: This paper has been quite famous before accepting to the conference. They experiment that when only training on adversarial examples, the model achieves non-trivial good accuracy on normal test set. This shows that adversarial examples are instead “non-robust features” that naturally presented in data, which can help generalization but will hurt robustness. See more on extensive discussion on Distill blog. There is also another paper which discovers similar findings is “A Fourier Perspective on Model Robustness in Computer Vision” (I missed this during the conference, definitely need to check it out later).

Note: After the conference, I found that the authors of 1st and 3rd paper are from Mądry Lab at MIT. They have another paper “Robustness May Be at Odds with Accuracy” (ICLR 2019) which also seems to be interesting. In addition, they have a nice library — Robustness for training and evaluating DNNs’ adversarial robustness.

Self Attention + Convolution is Hot!

Most of these papers share very similar idea of making convolutional kernel to function of the inputs (Conditionally Parameterized Convolutions for Efficient Inference and Neural Similarity Learning) or just using purely self-attention (Stand-Alone Self-Attention in Vision Models) can achieve better performance than CNNs (however they show that the combined version is better since convolution is better for extracting low-level features). By the way, there is a recently published paper “On the Relationship between Self-Attention and Convolutional Layers” (ICLR 2020) studies the relationship between self-attention and convolution in depth.
Cross-channel Communication Networks: Self-attention augmented version of Squeeze-and-Excitation Networks (SENet). Instead of computing “excitation weights” from squeeze operation, they replace squeeze operation by self-attention to encourage information exchange across different channels.
Compositional De-Attention Networks: Motivated by the compositionality of language, they presents a more expressive attention composed by tanh [-1, 1] and sigmoid [0, 1] that is analogous to “compositional pooling”. Compared to the traditional attention that is usually presented by [0,1] weights, their proposed attention can not only express the “opposite direction” (-1), but also can express relative importance like the traditional one. They achieve better performance than Transformer.

Interpretability and Visualization

*FullGrad* (left) and Explanation Manipulation (right)

Full-Gradient Representation for Neural Network Visualization: This paper presents FullGrad, a gradient-based saliency-based visualization which produces more sharper results than commonly used methods. The main idea is that the output of the DNNs can be exactly decomposed into two terms, namely “input-gradients” and “bias-gradients” terms. Together, these terms constitute to “full-gradients” and are used for visualization. Code are available at here.
Explanations can be manipulated and geometry is to blame: This paper shows that the gradient-based saliency-based explanations can be manipulated arbitrarily by applying perturbations to the input that keep DNNs’ output approximately constant. They also provide method that can undo the manipulation (i.e., increase the robustness of saliency maps). This naturally also makes me wonder that whether improving the robustness of DNNs can also improve the robustness of explanations. It turns out that there is a paper called “On the Connection Between Adversarial Robustness and Saliency Map Interpretability” (ICML 2019) related to my thoughts. May check it out soon!

Friday (12/13): Workshop on Shared Visual Representations in Human & Machine Intelligence

After a series of serious technical stuffs, I would like to end up this post by the interesting talk by Bill Freeman (a.k.a., Willian T. Freeman) from MIT — “Feathers, Wings, and the Future of Computer Vision Research”.

Bill first discussed how computer vision papers were generated from 2013–2019: Open to a random page of “Computer Vision — A Modern Approach” textbook, add “deep” before subject of that page, or append “GAN” after it.
Bill then predicted that the key to produce good paper in 2020–2025 is being able to distinguish wings from feathers from “Vision Science” textbook, and add “Architecture for …” in front of that concept.
Bill ended the talk by playing “Feather vs. Wings” game with the audience.