# NeurIPS 2018 Summary

This year was my first and hopefully not last time attending NeurIPS, and I felt like sharing a few thoughts and papers. Overall it was an exciting and also very impressive experience as everything was an order of magnitude larger than what you typically see at conferences focused on medical image analysis. I always get a huge motivational boost from conferences, and this was no exception. It also drives me to compile a list of papers that I want to check out when I get back home, and what I usually do with that list is: nothing (or close to, I don’t read nearly as much as I intended). So this time I decided to force myself to read a little more by writing a short (short) summary of papers I liked, papers I might find useful in the future and papers that might be of interest to others in my group. So here it is, my — thoroughly incomplete — NeurIPS 2018 summary! Also, if you want to be sure you’re still surfing the right wave, these topics are apparently (still) hot:

- GANs
- Adversarial & out-of-distribution robustness
- Variational inference & Bayesian stuff in general
- Reinforcement Learning
- Optimization
- Causality

If you have questions, suggestions, feedback etc., do feel free to reach out via Twitter.

## (Official) Best Papers

These papers are probably must-reads, but I only skimmed through most of them (except the Neural ODEs). The first is from 2007 and won the Test Of Time Award this year, the others shared in this year’s Best Paper honors.

- “The Tradeoffs of Large Scale Learning” basically established that Stochastic Gradient Descent works well if your dataset is very large.
- “Neural Ordinary Differential Equations” shows how ODE solvers can be used to construct what look to be continuous analogues of residual nets, without the requirement that the solver be differentiable. Going from layer N to N+1 in a residual net can be interpreted as an update step of a discretized ODE, with the input and output to/from the hidden layers representing initial/final states of the dynamics, and the authors propose replacing the residual hidden layers with a black-box ODE solver. They show that the backward pass can also be implemented as a forward pass through a solver for a different “augmented” ODE, allowing the component to be integrated with conventional network training. These neural ODEs have O(1) memory cost as opposed O(L) for L-layer ResNets and can trade off accuracy for computational efficiency. The authors also derive Continuous Normalizing Flows (CNFs) that seem to be more efficient and qualitatively comparable or better than regular normalizing flows. A PyTorch implementation by the authors already exists here.
- “Optimal Algorithms for Non-Smooth Distributed Optimization in Networks” extends previous work by the same authors to non-smooth functions and introduces an algorithm called
*distributed random smoothing*. I imagine distributed optimization will become increasingly important and many real-world objectives cannot be assumed to be smooth, so there you go :) - “Nearly tight sample complexity bounds for learning mixtures of Gaussians via sample compression schemes” provides tighter bounds for the number of samples that are necessary to estimate a mixture of Gaussian distributions up to a given error.
- “Non-delusional Q-learning and value-iteration” introduces a concept/source of error in Q-learning that the authors term
*delusional bias*, as well as a corresponding remedy. In my understanding, delusional bias describes a situation where the update step selects an action that shouldn’t have been possible in combination with all previously selected actions, but nevertheless ended up being selected, because we use function approximation (read neural nets) for the Q-function and the approximation changes after each action.

## Potentially Useful

- “Generalized ELBO with Constrained Optimization, GECO” (Bayesian DL workshop) tries to offer a principled approach to weighting KL and reconstruction losses in VAE training. Longer version on arxiv.
- “Bayesian Layers: A Module for Neural Network Uncertainty” introduces a Tensorflow extension for easy experimentation with Bayesian neural networks and Gaussian processes. It’s designed as “a drop-in replacement for other layers”, so should be very easy to integrate. Making everyone else’s setup look pathetic, the authors state: “As demonstration, we fit a 10-billion parameter “Bayesian Transformer” on 512 TPUv2 cores”.
- “Generalized Cross Entropy Loss for Training Deep Neural Networks with Noisy Labels” marries categorical cross entropy (not robust to noisy labels) with mean absolute error (more robust to noise, but performance can be poor compared to CCE) by introducing a generalized loss with a parameter q so that q -> 0 recovers CCE and q -> 1 recovers MAE. Experiments suggest that the correct choice of q can greatly improve convergence and performance. The authors do not offer a principled approach to selecting q, but state that q=0.8 works well on average.
- “DropMax: Adaptive Variational Softmax” is a version of SoftMax that randomly sets exp(…) terms to 0 following a learned (from the layer before the SoftMax input) Bernoulli distribution. The authors show improvement over regular SoftMax on a number of public datasets.
- “DropBlock: A regularization method for convolutional networks” argues that Dropout is often not ideal for convolutional networks, because feature map activations (read pixels) are spatially correlated. The authors propose dropping out contiguous regions instead and show slightly higher scores compared to regular Dropout for ImageNet classification and COCO detection. Probably not worth implementing it yourself, but if it’s available in your framework at some point, why not try.
- “An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution” shows that convolutions are not ideal for problems that don’t exhibit (full) translational invariance (example: given input coordinates, construct an image with only the coordinate pixel active/white/1, but also more advanced image generation), which shouldn’t be surprising. A simple fix is proposed that consists of augmenting the input to a conv layer with feature channels that encode coordinates, which allows models to learn variable degrees of translational invariance/dependence.

## Medical & Related

There were two workshops, one on medical imaging (MED-NEURIPS) and one on health-related stuff in general (ML4H). I didn’t go to the second one, but the imaging workshop had decent contributions, although nothing groundbreaking as far as I can tell (except for the one titled “A Case for the Score”, of course :D). Submissions were only 3 page abstracts, so I didn’t bother to summarize them, they’re all linked on the workshop page. Other contributions that might be of interest to those in the medical field:

- “A Probabilistic U-Net for Segmentation of Ambiguous Images” combines a U-Net with a conditional VAE to encode semantically meaningful variations of segmentations, which is especially useful when the input images are ambiguous (e.g. tumor outline). One can sample consistent segmentation hypotheses and also estimate the likelihood of a given (image, segmentation) pair. The authors show that their approach is much better calibrated than competing methods that can also produce multiple segmentations for a single input. An additional benefit is that this works with very low-dimensional latent spaces so it’s easy to manually/visually explore. A Tensorflow implementation by the first author is also available.
- “Direct Uncertainty Prediction for Medical Second Opinions” (Bayesian DL workshop) shows that directly predicting an uncertainty score that represents expert disagreement is better than training a model on the task and then deriving an uncertainty score from the predictions. Not so surprising if I’m honest…
- “MiME: Multilevel Medical Embedding of Electronic Health Records for Predictive Healthcare” tackles the problem that learning from EHRs often requires large amounts of data. The authors leverage the fact that EHRs are intrinsically hierarchical and generate embeddings for each level from low to high (treatment, diagnosis, visit, patient), i.e. diagnosis and treatment embeddings are mapped to diagnosis objects (the paper introduces symbols that are less confusing…), these are then aggregated to form a visit representation and so on. Diagnosis objects are made more informative by auxiliary tasks, predicting diagnosis and treatment. Doesn’t look super complicated, but outperforms baselines especially on small datasets.
- “DifNet: Semantic Segmentation by Diffusion Networks” proposes to decompose a segmentation network into two branches, one for “seed detection”, i.e. giving rough initial predictions (easy), and one to estimate a measure of similarity between pixels (supposedly easy). The first task is also separated into two, predicting separate score maps (probabilities for each class) and importance maps (to reweight score map…). A random walk is applied, starting from the seeds and using the similarity map as transition matrix. The last step is repeated multiple times. The authors achieve marginally higher scores than DeepLabV2 on PASCAL-VOC, but the approach seems overly complicated for a problem that’s essentially solved (i.e. merging high-level semantics with localized information). Only on this list because it’s segmentation…
- “FishNet: A Versatile Backbone for Image, Region, and Pixel Level Prediction” argues that gradient flow is paramount to maximize performance and consequently tries to construct an architecture that maximizes gradient flow by using feature concatenation as opposed to residual addition with a convolution to match the number of channels. The authors show decent performance, slightly outperforming DenseNet and ResNet in terms of performance per parameter on ImageNet classification, but also on MS-COCO segmentation and object detection (as a backbone for Mask R-CNN and FPN). At the end of the day it looks like a U-Net with an additional encoder at the end (resulting in a fish-like shape) with skip connections from the decoder to the second encoder. A PyTorch implementation is available.

## General Understanding and Overviews

- “Do Deep Generative Models Know What They Don’t Know?” (Bayesian DL workshop) shows that generative models can put (too) high likelihood on out-of-distribution samples and entire distributions, e.g. training on MNIST results in overall higher likelihood on SVHN vs. MNIST test set.
- “Recent Advances in Autoencoder-Based Representation Learning” (Bayesian DL workshop) looks like a nice review of generative models.
- “How Does Batch Normalization Help Optimization?” shows that the reason for BatchNorm’s performance is not that it reduces internal covariate shift, and that instead it smoothes the loss surface, which should produce more stable gradients. However, the reason for the latter remains to be discovered. Recommended!
- “Understanding Batch Normalization” shows that BatchNorm permits using larger learning rates, which goes hand in hand with the findings above.
- “Are GANs Created Equal? A Large-Scale Study” serves as a nice overview of existing GANs, but also suggests that differences in performance reported in the individual papers are not due to improved architectures but rather hyperparameter tuning and computational budget. Not quite new but a good read.

## Generative Models & Representation Learning

- “Attentive Neural Processes” (Bayesian DL workshop) extends neural processes with attention, improving predictions close to seen data points at the price of computing cross attention of encodings and query keys. Also submitted to ICLR 2019
- “Glow: Generative Flow with Invertible 1x1 Convolutions” introduces a new type of flow based on blocks consisting of 1) a channel normalization, 2) invertible 1x1 convolutions and 3) affine coupling. The authors compare their approach to RealNVP and demonstrate slightly (or “significantly”, but there is no mention of any test) better performance (log-likelihood) on CIFAR-10, ImageNet and LSUN. Qualitative examples on CelebA-HQ look convincing.

There was also a ton of stuff on how to better structure your latent space that I haven’t yet read, so summaries may or may not appear later:

- “Learning Disentangled Joint Continuous and Discrete Representations”
- “Learning Latent Subspaces in Variational Autoencoders”
- “Gaussian Process Prior Variational Autoencoders”
- “Learning to Decompose and Disentangle Representations for Video Prediction”
- “Information Constraints on Auto-Encoding Variational Bayes”
- “Variational Memory Encoder-Decoder”

## Cool Stuff

There was a lot going on that is at best remotely related to my work, and it so happened that there was a large overlap between what was least relevant to me from a practical perspective and what I found most exciting/stimulating/interesting.

- The “Machine Learning for Creativity and Design” workshop. I wish I had attended more of their sessions. What I did see was a talk by Allison Parrish who uses word embeddings to understand and create poetry. This is her blog: https://www.decontextualize.com/
- “Learning models for visual 3D localization with implicit mapping” (Bayesian DL workshop) combines GQN with attention to work in much more advanced environments, specifically Minecraft. Looks similar to Attentive Neural Processes. Also submitted to ICLR 2019.
- “DeepProbLog: Neural Probabilistic Logic Programming” combines ProbLog (a probabilistic programming language I’ve never heard of) with deep learning. The authors show some neat examples of arithmetic with MNIST images.
- “Learning to Infer Graphics Programs from Hand-Drawn Images” converts hand-drawn sketches into a description using primitive shapes (circles, rectangles, lines) and then also synthesizes a graphics program that recreates the drawing (and makes use of symmetries and repeating patterns!). Awesome!
- “Learning to Reconstruct Shapes from Unseen Classes” predicts 3D shapes from images. Images are converted to depth maps that are then projected to a 3D model, but also projected onto a sphere, where learned inpainting is performed. This is also projected to a 3D model and a final network joins the 2 for a final 3D shape. Reconstructions of classes not in the training set look a lot better than those from baseline methods.
- “Recurrent World Models Facilitate Policy Evolution”, otherwise known as just “World Models”, separates learning a representation of an environment from learning a policy to act in the environment. There are three components, 1) a simple VAE that encodes observations (images), 2) a recurrent model that tries to predict future encodings of the VAE from current action, current encoding and a hidden state (I think this is what they call the world model), 3) a simple controller that selects actions based on the current encoding and hidden state. According to the paper this is the first work to solve the CarRacing-v0 environment and the whole thing is really fast to train (few hours on a single GPU according to paper, haven’t tried)!