NIPS’17 highlights and trends overview

The emergence of Deep Learning 2.0 and the return of Bayes

A rather long registration line on the opening day of NIPS’17

Just a few years ago, the NIPS conference was a small gathering of a couple hundred people. Fast forward to 2017 and it’s hosting a whopping 8,000 attendees. This dramatic increase in popularity has resulted in a large and diverse program covering a wide range of topics in AI and machine learning. This year, unsurprisingly, the latest developments in deep learning were the dominant theme of the conference.

For those AI researchers, practitioners, or enthusiasts who did not have the opportunity to attend NIPS 2017 and are not too eager to study its ridiculously massive proceedings, this article is intended to be an insightful summary of my observations. Admittedly, it is not comprehensive, as I did not have the opportunity to attend the entire conference program.

1. Deep learning 2.0 is emerging

The next generation of deep learning (DL) methodologies are going beyond learning using just vectorized data, forming symbiotic relationships with Bayesian learning (AI’s other powerhouse of methodologies), and even attempting to tackle reasoning-level AI problems.

DL with non-vector data

  • Deep sets: Many real-world applications of DL (e.g. classification of point cloud sensory data of LiDARs) involve data that naturally have a set structure rather than a fixed dimensional vector. As a reminder, in contrast to a vector, entities of a set are notordered. The Deep Sets architecture and its underlying theory, proposed by Zaheer et al., enables learning directly with such data using permutation-invariant cost functions. Their extensive experimental results demonstrate DeepSets’ applicability across supervised, unsupervised, and anomaly detection tasks involving set data.
  • Geometric DL: A lot of recent successes of DL have been with 2D image and audio data that are Euclidean in structure. However, a wide range of applications, e.g. social networks, involve data best represented using non-Euclidean structures such as graphs or manifolds. The existing methodologies do not easily generalize to such data as even the definition of some of the basic operations they employ, e.g. convolution, is elusive. The emergent field of geometric DL aims at extending DL methodologies to tackle this challenge. A half-day tutorial at NIPS’17 was dedicated to this research area. If you’d like an overview of the topic, see this podcast from the excellent This Week in Machine Learning and AI.
  • Hierarchical embeddings: Linear embeddings methods have been massively successful in learning vector representation of latent space for symbolic objects, e.g. Word2Vec for words and Node2Vec for graphs. Nonetheless, their ability to model complex patterns is fundamentally bounded by their embedding space dimensionality, which can become prohibitively large. Some classes of symbolic data can be efficiently encoded according to a latent hierarchy. Maximilian Nickel and Douwe Kiela from FAIR proposed an approach to compute hierarchical embeddings in the hyperbolic (instead of the Euclidean) space. More specifically, they adopted the Poincaré ball model, which is a particular hyperbolic space that lends itself to gradient-based optimization. Once applied to embedding of data with hierarchical nature, such as taxonomy and network data, the Poincaré embeddings were shown to outperform Euclidean embeddings especially in low-dimensional regimes. With Facebook being the largest social network in the world with a ton of data amenable to this hierarchical embedding approach, it will be intriguing to observe the future of this research work at FAIR.
  • CapsNets: Although not strictly aimed at processing non-vector input data, this section would not be complete without including a discussion of Capsule Networks (CapNets). CapNets extend the (low-level) flow of information through a neural network from scalers to vectors by replacing neurons with the so-called capsules. These vectors are meant to explicitly model the pose information of desired entities in input data. The main motivation of CapsNets is to solve a key problem with convolutional neural networks’ (CNN) internal data representation, which does not account for important spatial hierarchies between simple and complex objects. The algorithm to train these networks was presented at NIPS’17 with some promising results.
Architecture of a simple three layer Capsule Network

DL’s symbiotic relationship with Bayesian learning

As mentioned by Yee Whye Teh during his keynote speech at NIPS’17, Bayesian learning and deep learning are complementary ideas. The former views learning as performing inference in a probabilistic setting, enabling an explicit representation of prior knowledge and a unified treatment of uncertainties. On the other hand, the latter considers learning to be an optimization of objective functions parametrized as a neural network, providing a very flexible learning paradigm. Accordingly, by marrying these two schools of thought, one can develop novel AI methodologies with the flexibility of DL and yet capable of systematic incorporation of priors and handling of uncertainties.

Yee Whye Teh’s keynote speech on “Bayesian Deep Learning and Deep Bayesian Learning

Bayesian DL typically involves defining a joint distribution over parameters of a deep network along with a certain prior (usually a Gaussian) and performing inference (using Markov Chain Monte Carlo or Variational Inference) to estimate the posterior of this joint distribution given a training dataset. On the other hand, deep Bayesian learning approaches leverage deep networks to improve upon the flexibility (e.g. variational auto-encoders) and/or scalability (e.g. distributed Bayesian learning) of purely Bayesian methods. Following the recent success and popularity of Bayesian DL, NIPS’17 included another full-day workshop dedicated to this topic, as with the year before. At Element AI, we had two articles presented in this workshop discussing Bayesian hypernetworks for approximate Bayesian inference in neural networks, as well as using Bayesian techniques to learn the prior distribution over neural network parameters. Aside from the official conference events, there was a meeting organized to discuss Uber’s newly released deep probabilisitic programming language (Pyro)—yet another indication of an ever-growing interest in this area.

DL going beyond perception

The majority of DL applications so far can be considered different flavours of perception: taking some raw input data and making sense of it. Nonetheless, true intelligence also involves sophisticated reasoning processes. A couple of NIPS’17 articles presented DL solutions to enable some form of reasoning.

  • Relation networks (RN): In an article that made a splash even before NIPS’17, Santoro et al. proposed a simple and pluggable module called relation networks (RN) that can be used to augment CNNs, LSTMs, and MLPs with the capability to reason about the relations between entities and their properties. Their experimental results for three tasks involving relational reasoning demonstrate the ability of RNs to achieve state-of-the-art and in some cases even super-human levels of performance.
  • MODERN network: Recent studies have shown a tight coupling of language and vision processing in the brain. In particular, language cues that people hear before they see an image can activate parts of the brain involved in visual predictions and speed up their image recognition process. Inspired by these neuroscientific discoveries, de Vries et al. propose a “MODulatEd” version of the popular ResNet architecture (aka MODERN). For a visual QA system, their idea is to use a given question’s text to inform visual feature extraction through a Conditional Batch Normalization (CBN) mechanism. Their experiments show MODERN significantly improving strong baselines on two visual question answering tasks, hence confirming that using linguistic cues to modulate (even the early stages of) visual processing is very useful.

2. GANs made (more) practical

Generative Adversarial Networks (GANs) are becoming increasingly popular for learning generative DL models, especially for novel image synthesis applications. However, their applicability has been somewhat hindered by issues such as training instability and the so-called “mode collapse” problem, i.e. lack of diversity in generated samples. Following in the footsteps of numerous recent works that either aim at tackling these issues or developing more efficient GAN alternatives, NIPS’17 hosted a number of proposals such as Bayesian GAN, D2GAN, and DGAN.

Bayesian GANs: Inspired by the recent trend of marrying Bayesian learning and DL discussed above, the authors proposed conditional posteriors for both the generator and the discriminator weights, which are then marginalized using the well-known stochastic gradient Hamiltonian Monte Carlo method. As demonstrated for several semi-supervised learning experiments, Bayesian GANs are capable of alleviating the mode collapse problem, although it is not completely absent as the sampling process can still get stuck in sharply peaked modes. Nonetheless, the results are still quite promising as Bayesian GAN is shown to outperform powerful generative methods such as WGAN and even an ensemble of DCGANs!

Dual Discriminator Generative Adversarial Nets (D2GAN) and Dualing GANs (DGANs): Both of these articles explore deploying the notion of duality to make GANs more robust. The former (D2GAN) aims at alleviating the mode collapse problem, whereas the latter (DGAN) seeks to improve the stability of GAN training. The D2GAN’s idea is to have two (dual) discriminators and one generator. One discriminator yields high scores for samples from the data distribution whereas the other rewards data from the generator, and the generator’s goal is to fool them both! One particular feature of D2GAN is its scalability to large real-world datasets such as ImageNet. On the other hand, the main idea of the DGAN is to replace the minimization problem used to train discriminator with its dual maximization problem, which is much easier to solve. For linear discriminators the experimental results show training to be stable and guaranteed to converge monotonically. However, for non-linear discriminators there are no such guarantees although training oscillations (if present) are less severe than those seen with standard GANs.

3. Machine learning (ML) security as an emerging research field

As ML-enabled applications become more and more ubiquitous, their security and robustness become more important. NIPS’17 included a full-day workshop on “machine deception,” which was partially dedicated to adversarial attacks, their underlying theory and possible defence mechanisms. The following were some of the most interesting presentations.

A famous example of adversarial attacks: a classifier fooled to mis-categorize a panda as a gibbon

Adversarial attacks

Two novel approaches were proposed, both of which deviated in their attack form from the commonly used methods:

  • Natural adversarial attacks: This approach aimed at generating “natural” adversarial examples using malicious, yet semantically meaningful, perturbations. Accordingly, the resultant natural adversarial examples were shown to be applicable to text domain models and, perhaps more importantly, to be human legible as shown in the table below. In particular, in contrast to the existing textual adversarial generation methods that are using domain-specific rules or heuristics and typically require manual intervention, the proposed approach is fully automatic.
Textual “Adversaries” that find dropped verbs in English-To-German translation. The original sentence (s) and its adversary (s’) are shown on the left column along with their German translation on the right column
  • Adversarial patches: Authors proposed a method for producing adversarial patches using local and typically large (thus human perceptible) perturbations (see their demo below). Accordingly, existing defences, which focus mainly on resisting small global perturbations, fail to protect against this attack. The main result of this work was to show that given a large enough patch of local perturbations, any ML model can be fooled into misclassifying a sample as a desired class (targeted attack) regardless of the other items in the scene (universal attack), as well as location, and/or orientation of the patch (robust attack). Theoretically, this is possible because an adversarial patch can always move a sample into the error region of a model unless that model is perfect and has no error region.
Adversarial patches “toaster demo”

I should also mention an interesting study on adversarial attacks on DL interpretation methods. The authors show that common techniques such as saliency maps and DeepLIFT, as well as more recent methods such as influence functions, are all highly vulnerable to adversarial attacks, i.e. two perceptually indistinguishable input samples with identical predictions can be assigned very different interpretations. They further provide some theoretical insights as to the underlying reasons for these vulnerabilities and recommend taking caution when using these interpretations especially for making critical judgements, e.g. in social or biomedical application settings. Unfortunately, their study does not include some of the popular interpretation methods such as LIME (and its recent generalization).

Adversarial defences

The methodologies in this category provided novel formulations of adversarial attacks along with a few new proposals to defend against them:

  • Minimax-optimal defence: The authors formulated adversarial attacker and defender interaction as a zero-sum leader-follower game where the attacker tries to maximize the risk of misclassification task by perturbing input samples under certain constraints while the defender/classifier tries to adjust its parameters to minimize the same risk given the perturbed input. Using a novel optimization-based solution to this game, they described how to develop best worst-case defence methods. Intriguingly, they also used their formulation to demonstrate how adversarial and information attacks are actually two sides of the same coin and can thus be tackled through similar approaches.
  • Adversarially Augmented Adversarial Training (A3T): The key idea behind the A3T is to achieve robustness to adversarial perturbations by enforcing the latent space learnt by a deep network to be invariant to added noise. This is accomplished by attaching a discriminator network to a hidden layer of a given deep network and jointly training the two. The goal of this discriminator is to successfully identify real samples. Their preliminary experiments show the A3T outperforming standard adversarial training in terms of protecting against attacks.
  • Thermometer Encoding: It has been hypothesized that the high linearity of deep networks is (at least partially) responsible for their vulnerability to adversarial attacks (although there are other underlying issues, such as deep networks’ inability to actually learn high-level abstract concepts as shown by experimental evidence in a recent study of CNNs). The key idea proposed by the authors is to place the defence mechanism only at the input layer. It uses an extremely nonlinear transformation, i.e. thermometer encoding, to reduce networks’ linearity. The thermometer encoding is similar to the well-known one-hot encoding but it also preserves the pairwise ordering of input data. The experiments show thermometer encoding successfully breaking networks’ linearity and improving upon the vanilla adversarial training.

4. Other noteworthy presentations

Meta-learning as the projected future of RL

Reinforcement learning going meta: In spite of the impressive recent achievements of Reinforcement Learning (RL) such as defeating master players at sophisticated games of Go and Poker, progress in RL has been fundamentally hindered by its reliance on human ingenuity to invent new learning algorithms. This might all be about to change thanks to interest in learning techniques that allow for machines to discover and/or invent their very own approach to learn (aka meta-learning). Pieter Abbeel, one of the pioneers of deep RL, gave a fascinating keynote speech on DL for robotics at NIPS’17, discussing meta RL among other things.

Empirical study of batch size impact on training of neural networks: The authors presented an interesting investigation of large batch size training regimes and the related so-called “generalization gap” problem. They show that there is no such gap present inherently, i.e. it’s possible to attain good generalization with a large batch size through adjustments to the learning rate and the batch normalization process. Most interestingly, their experimental results demonstrate the advantage of continuing to train in an apparent “overfitting regime,” contrary to the common wisdom.

The advantage of continuing to train in an apparent “overfitting regime”

Fader networks: Borrowing ideas from both adversarial and auto-encoder networks, the authors proposed a novel hybrid encoder-decoder network architecture. The encoder maps images to latent representations whereas the discriminator attempts to predict the desired attribute(s) given latent representations. On the other hand, the decoder uses latent representations and the given value(s) of desired attribute(s) to reconstruct the images. This clever setup enables learning of a latent space where salient information of images and values of attributes are disentangled. In addition, it allows Fader nets to be applied to non-image domains, such as speech and text, where the data generation process could be non-differentiable.

Hybrid architecture of the FaderNet

Dual path networks (DPN): The authors first show the complementary nature of ResNet and DenseNet, i.e. the former enabling feature re-usage while the latter enabling new features exploration. In other words, the common exploitation vs exploration tradeoff. Next, they propose the novel DPN architecture to leverage the benefits of both aforementioned networks. Their experimental results for the task of image classification using large-scale datasets demonstrate DPN consistently outperforming DenseNet, ResNet and even the latest ResNeXt.

(Self) Attention is all you need: Attention mechanisms have been one of the most exciting recent developments in the DL community. They are typically applied along with a recurrent neural network to enable a model to focus on specific parts of input relevant to the output it generates at each timestep. Well-known examples include tasks such as language translation and image captioning. Interestingly, the Transformer network architecture demonstrates the feasibility of using solely attention mechanisms to perform language translation tasks. In fact, the Transformer is shown to outperform both recurrent and convolutional models while being less computationally expensive to train.

Deep feed-forward networks using the SELU: Most success stories of DL models have had architectures based on recurrent and/or convolutional networks while feed-forward networks (FNNs) were either missing completely or remained shallow. Using the “scaled exponential linear units” (SELUs) shown below as activation functions, the authors prove it is possible to achieve the self-normalizing property required to robustly train deep FNNs. In other words, instead of normalizing the output of activation function (e.g. using batch normalization) the SELU activation function itself outputs normalized values. They refer to their novel architecture as self-normalizing neural networks (SNNs). Being deep, the SNNs are hence claimed to be capable of learning high-level abstract representations in input data. Extensive experiments using 121 machine learning tasks from the UCI repository show SNNs to significantly outperform all other variants of FNNs. It is worth mentioning that the ƛ and α below are two fixed parameters, derived from input data, and not hyperparameters.

Scaled exponential linear unit (SELU) activation function

Machine learning “alchemy”: Last but certainly not least, Ali Rahimi ignited fierce debate with his presentation for the test-of-time award.He decried our lack of theoretical understanding of modern (mainly DL-based) machine learning methodologies, labelled them as “Alchemy,” and triggered a very interesting conversation that is definitely worth checking out.

Ali Rahimi giving a speech during the test-of-time award presentation

And that was it! I hope this article has provided you with a glimpse of NIPS’17 highlights and emerging trends. As mentioned earlier, considering the wealth of topics covered at NIPS’17 this article is by no means comprehensive. Please feel free to use the comments section below and let me know if something significant is missing. I’ll do my best to update the article accordingly.

Finally, since you’re reading this article I assume you probably missed NIPS’17 and didn’t visit Element AI’s booth to learn more about us. If you are an aspiring data scientist, AI/DL expert and/or hacker and find the notion of developing an AI platform to democratize AI appealing then you have to check us out … we are hiring! :)