NeurIPS 2019 Notes

Howard Lo
14 min readJan 8, 2020

--

Table of Contents

  • Industrial Talks
  • Bayesian Deep Learning Tutorial
  • Deep Learning Theory
  • Optimization and Regularization
  • Adversarial Image Synthesis and Robustness
  • Attention-based CNN Architectures
  • Interpretability and Visualization
  • Interesting Talk at Workshop

Disclaimer: The content of this notes does not fully reflect the trends at NeurIPS 2019. Something I am interested but overlooked, for example, RL, meta-learning, graph representation learning and causal inference. There are also some other great posts in the followings for different perspectives of NeurIPS 2019.

Acknowledgement

First, I would like to thank my collaborators at NTHU and Academia Sinica Taiwan, Ting-I Hsieh, Hwann-Tzong Chen and Tyng-Luh Liu. Secondly, I would also like to thank my managers at MediaTek Taiwan, Yu-Lin Chang, Chia-Ping Chen and Shaw-Min Lei, who kindly support me to attend NeurIPS 2019. Without all the help, I won’t be able to document all the great stuffs I experienced at NeurIPS here!

Quick promotion! We are honored to have our paper “One-Shot Object Detection with Co-Attention and Co-Excitation” accepted at NeurIPS 2019. TL;DR. We present a simple and effective 3-step framework, co-attend, squeeze and co-excite, and rank, for the challenging task — one shot object detection (i.e., given one query image whose class label is never-seen during training, detect all instances of the same class in a target image). Go check out our PyTorch code on Github!

Sunday (12/8): Industrial Talks & Panels

Interpretability — Now What? by Google

Talk by Been Kim, who works as a senior research scientist at Google Brain, and is also an area chair at NeurIPS 2019. Her work mainly focuses on developing human-centric interpretable tools for deep learning model prediction and investigating fragility of saliency-based explanation methods for neural networks.

Sanity Checks for Saliency Maps. NIPS 2018 (spotlight)

When randomize weights, model makes garbage prediction. And when prediction changes, do explanation change? No!
  • Problem: Some widely used saliency methods for prediction explanation do not really reflect the evidence for prediction.
  • Propose two sanity check methods for explanation methods from the motivation: “When prediction changes, the explanation should change.”
  • Approach: Change prediction based on randomization test: 1) Model parameter randomization; 2) Data label randomization.
  • What we learned: We should be careful of confirmation bias when developing explanation method. Just because it “makes sense” to humans, doesn’t mean it reflects the evidence for prediction.

TCAV: Testing with Concept Activation Vectors. ICML 2018.

  • Problem: How much a concept (e.g., gender, race) was important for a prediction in a trained model…even if the concept was not part of the training? Can we quantitatively measure how important any of these user-chosen concepts are?
  • Propose TCAV to provide quantitative importance score of a concept if and only if your network learn about it. TCAV does not require to change or retrain your network.
  • Approach: Given a set of user-chosen concept images, random images and trained network, train a linear classifier to separate activations from concept or random images. Testing with CAV by using directional derivatives to measure “conceptual sensitivity” of class k to concept C.
  • Limitations: 1) Concept has to be expressed using image examples; 2) User needs to know which concepts they want to test for it. Follow-up work “Towards Automatic Concept-based Explanations” (NeurIPS 2019) to automatically discover concepts for images. (3) Explanations provided by TCAV are not casual. A Follow-up work on casual TCAV“On Concept-Based Explanations in Deep Neural Networks” submitted to ICLR 2020.

Private Federated Learning by Apple

On-device training on user data using federated learning with differentiable privacy can be used to improve global models in the cloud. Apple has started to use this tech in iOS 13 for variety of use cases including QuickType keyboard, FoundIn Apps, and Personalized “Hey Siri”. More details can be refer to the paper “Protection Against Reconstruction and Its Applications in Private Federated Learning” and the talk at WWDC 2019.

How the Game Industry is Driving Advances in AI Research by Unity

The ML-Agents toolkit (left) and the Obstacle Tower Challenge (right).

Monday (12/9): Tutorials

Deep Learning with Bayesian Principles [Video][Slides]

Talk by Emtiyaz Khan, who leads the Approximate Bayesian Inference (ABI) Team at RIKENAIP in Tokyo, and most of his work focuses on Bayesian deep learning. He has two interesting papers “Practical DL with Bayes” and “Approximate Inference Turns DNNs into GPs” accepted at NeurIPS 2019. And recently, he has been working on the paper version of this tutorial which may contains more detail.

TL;DR: Many existing optimization algorithms (e.g., SGD, RMSprop, Adam) in Deep learning and exact/approximate inference (e.g., Laplace, Variational Inference) in Bayesian learning can be derived from Bayesian principles (or “Bayesian learning rule”). We can leverage Bayesian principles to design better Deep learning algorithms for uncertainty, active learning and life-long learning.

The gap between human learning and most of the machine/deep learning
  • Problem: Human learns is in a sequential update way, a type of life-long learning. We continuously interact slowly with the environment, get small feedbacks, and continual to learn and improve our knowledge about the world. Even when the environment changes (non-stationary), we can still adapt and adjust. This kind of learning is very different from the kind we see right now. For example, Deep learning is really bulk learning. we believe that everything that we need to generalize in the world is all presented in the large amount of data (stationary), and then we suck all of the knowledge in the data to our network. Emtiyaz thinks that the best possible mathematical framework that explains human learning is Bayesian learning.
  • Important concept: Deep learning (DL) is “local/simple” method (trying to find one possible model → scale to large problem), where Bayesian learning is “global/complex” method (trying to find all possible model (posterior) utilizing supportive prior/belief → do not scale to large problem). Then to improve DL algorithms, we just need to add some “global” touch to the DL algorithms.
  • The intuition of derivation of DL algorithms from “Bayesian Learning rule”: Suppose we have 2D posterior distribution. From the simple one, we can approximate it with Gaussian (red circle) where we can only estimate the mean with fixed covariance matrix (i.e., moving the red circle around the 2D posterior). If we only estimate the mean using “Bayesian learning rule”, we get something like first-order method (e.g., SGD). If we also estimate the covariance (i.e., multi-variate Gaussian), we get second-order method (e.g., Newton). If we estimate each of the multi-variate Gaussian, we get Ensemble of Newton method. When approximating to exact posterior distribution, the “Bayesian learning rule” just becomes Bayes’ rule. See more in “Fast and Simple Natural-Gradient Variational Inference with Mixture of Exponential-family Approximations” ICML (2019).
  • Improving DL algorithms (e.g., RMSprop, Adam) by adding “Bayesian touch”: They propose Variational Online Gauss-Newton (VOGN) that learns like RMSprop/Adam but has uncertainty as side product! See more in their “Practical DL with Bayes” (NeurIPS 2019) paper.
  • Challenges: Recent paper Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift (NeurIPS 2019) contradicts the principles. They found that current Bayesian DL methods are not sufficient to estimate good uncertainty under dataset shift and the non-Bayesian ensemble still works the best. This tells us that current Bayesian DL methods may not be “global” enough! Especially for non-convex problem (DL), local approximation only captures “local uncertainty”. Computing better posterior approximation and better higher-order gradients are still remaining in open challenges.

Tuesday-Thursday (12/10-12/12): Main Conference

Deep Learning Theory

  • Uniform Convergence may be Unable to Explain Generalization in Deep Learning (Outstanding New Directions Paper): Existing generalization bound methods which based on uniform convergence (e.g., Rademacher complexity, PAC-Bayes, Covering Numbers, Compression), may not be lead us to explain why over-parameterized DNNs can generalize well. They found that: 1) Existing bounds grow with training set, which is empirically not true; 2) They use “hypersphere binary classification task” to prove that any uniform-convergence based generalization bound will fail to explain generalization. High-level idea here is that the decision boundary by SGD on over-parameterized DNNs can have certain complexities which hurt uniform convergence, but without hurting generalization.

A New Perspective of Understanding Deep Learning — Infinitely Wide Neural Network & Neural Tangent Kernel

Exact Computation (top) and Neural Tangents (bottom)

An infinitely wide (over-parameterized) neural networks can be approximated into linear models with a kernel called the Neural Tangent Kernel (NTK).

For a learning theory newbie, the direction here may be kind of overwhelming, especially NTK that just emerged from the last year’s NeurIPS. Here I would like to recommend some of the great posts that covers the NTK preliminaries: 1) Understanding the Neural Tangent Kernel by Rajat; 2) Ultra-Wide Deep Nets and Neural Tangent Kernel (NTK) by Wei Hu and Simon Du.

  • On Exact Computation with an Infinitely Wide Neural Network: This paper show how to exactly compute NTKs for CNNs (CNTKs) to let us simulate an infinitely wide CNNs. They found that: 1) the CNTK performance is correlated with CNNs; 2) The techniques that lead to performance improvement for CNNs (e.g., Global Average Pooling) can also improve the performance of CNTKs; 3) Theoretically the CNTK is the infinitely wide version of CNNs. However, there is still a performance gap between CNNs and CNTKs, which means that NTK could be only one of the directions to figure it out why over-parameterized DNNs generalize well.
  • Neural Tangents: Fast and Easy Infinite Neural Networks in Python: A Python library designed to enable research into infinitely wide neural networks. This software paper actually appeared in Bayesian Deep Learning Workshop on Friday (12/13). I place it here since due to the better content arrangement. Recently it has also been accepted in ICLR 2020. Checkout their paper and code!
Towards Explaining Reg. (top); Time matters and Asymmetric Valley (bottom)
  • Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks: This paper empirically studies why large initial learning rate (LR) + annealing is crucial for generalization. They perform an interesting experiment: Train a classifier on modified CIFAR10 where some images are attached with colored patches that identify the class (i.e., “class signatures”). They found that small LR will quickly memorizes these hard-to-fit class signatures, while the large LR will learn to use easy-to-fit patterns and start to memorize hard-to-fit patterns after annealing. The intuition of why large LR works like this may be due to the initial weak representation power caused by noisy gradients from large LR. Therefore, it can prevent overfitting to “signatures”.
  • Time Matters in Regularizing Deep Networks: This paper shows that applying regularization (weight decay or data augmentation) in initial training is crucial then in later training (i.e., regularization in deep networks does not work by re-shaping the loss function at convergence).
  • Asymmetric Valleys: Beyond Sharp and Flat Local Minima: This paper proposes a theory that there exists many asymmetric directions (flat or sharp) at a local minimum. They prove that a solution biased towards the flat side generalizes better. Then, they show that averaging SGD gradients implicitly induces such biased solutions (this explains why averaging SGD gradients generalizes better!)

Adversarial Image Synthesis and Robustness

Image Synthesis (left) and Adv. Mixup (right)

Note: After the conference, I found that the authors of 1st and 3rd paper are from Mądry Lab at MIT. They have another paperRobustness May Be at Odds with Accuracy (ICLR 2019) which also seems to be interesting. In addition, they have a nice library — Robustness for training and evaluating DNNs’ adversarial robustness.

Self Attention + Convolution is Hot!

Interpretability and Visualization

FullGrad (left) and Explanation Manipulation (right)
  • Full-Gradient Representation for Neural Network Visualization: This paper presents FullGrad, a gradient-based saliency-based visualization which produces more sharper results than commonly used methods. The main idea is that the output of the DNNs can be exactly decomposed into two terms, namely “input-gradients” and “bias-gradients” terms. Together, these terms constitute to “full-gradients” and are used for visualization. Code are available at here.
  • Explanations can be manipulated and geometry is to blame: This paper shows that the gradient-based saliency-based explanations can be manipulated arbitrarily by applying perturbations to the input that keep DNNs’ output approximately constant. They also provide method that can undo the manipulation (i.e., increase the robustness of saliency maps). This naturally also makes me wonder that whether improving the robustness of DNNs can also improve the robustness of explanations. It turns out that there is a paper called “On the Connection Between Adversarial Robustness and Saliency Map Interpretability” (ICML 2019) related to my thoughts. May check it out soon!

Friday (12/13): Workshop on Shared Visual Representations in Human & Machine Intelligence

After a series of serious technical stuffs, I would like to end up this post by the interesting talk by Bill Freeman (a.k.a., Willian T. Freeman) from MIT — “Feathers, Wings, and the Future of Computer Vision Research”.

  • Bill first discussed how computer vision papers were generated from 2013–2019: Open to a random page of “Computer Vision — A Modern Approach” textbook, add “deep” before subject of that page, or append “GAN” after it.
  • Bill then predicted that the key to produce good paper in 2020–2025 is being able to distinguish wings from feathers from “Vision Science” textbook, and add “Architecture for …” in front of that concept.
  • Bill ended the talk by playing “Feather vs. Wings” game with the audience.
Some examples of “Feather vs. Wing” game from Bill’s options

Thanks! That’s all for this post! I hope you enjoy reading and also gain some inspiration as I did at NeurIPS conference.

--

--

Howard Lo

AI/CV Engineer at MediaTek Inc. / NTHU CS. I also write in-depth paper notes for AI-related papers on my Github: https://github.com/howardyclo/papernotes