Is Conditional Logic the New Deep Learning Hotness?

Carlos E. Perez
Intuition Machine
Published in
5 min readDec 28, 2016
Credit: https://unsplash.com/search/fork?photo=8yqds_91OLw

One clear trend is the pervasive use of conditional logic in state-of-the-art Deep Learning (DL) architectures. An even bigger trend to this is that DL architectures are beginning to look more like conventional computers (see: DeepMind’s Differentiable Neural Computer). In fact, you can follow the progress of DL research by recognizing how classical computational constructs are retrofitted into this new architecture.

Anyone who has an exposure to programming (or alternatively flowcharting) is aware that its composed of 3 kinds of things. There are 3 kinds of things that are explicitly obvious: computation, conditional logic and iteration (or recursion). In the beginning there was just the neuron, it had a computational unit (i.e. sum of products) and a conditional unit (i.e. activation function). Layers were added, to give it a way to be composed in an many different ways, this begat the Deep Learning revolution (this is the overly simplified version of the genesis story of DL).

Folks then said, let’s add loops into the network. That begat the RNN. The RNN was quite chaotic so a solution that included memory (or a buffer) that led to the more usable LSTM. The computational unit was further enhanced through a more powerful matching engine known as the convolution, this begat the Convolution Network (ConvNet). The ConvNet also had a more general activation function, the pooling layer. The work horses of classical DL are ConvNets and LSTMs.

There is plenty of research work with the areas of improving the computational unit as well as in introducing memory. However, I have found the developments in conditional logic quite fascinating and illuminating. This is because a conditional unit is so simple that it can’t possibly be very interesting. In this article, I will show you why the neglected conditional logic unit is the hottest thing since the “Residual network”.

Let’s start then with the Residual network. The Residual network surprised the DL community by besting the 2015 ImageNet benchmark with a record breaking number 152 layers. 8 times deeper than previously seen. There have been several research papers that analyze the behavior of the Residual network. The latest one comes from Schmidhuber et. al. “Highway and Residual Networks learn Unrolled Iterative Estimation”. Their conclusions are important:

A group of successive layers iteratively refine their estimates of the same features instead of computing an entirely new representation. According to the new view, successive layers (within a stage) cooperate to compute a single level of representation. Therefore, the first layer already computes a rough estimate of that representation, which is then iteratively refined by the successive layers. Unlike layers in a conventional neural network, which each compute a new representation, these layers therefore preserve feature identity.

This confirms other research on this subject. In addition though they discovered the following:

We found non-gated identity skip-connections to perform significantly worse, and offered a possible explanation: If the task requires dynamically replacing individual features, then the use of gating is beneficial.

Residual networks since 2015 are all the rage everywhere. Systems that have used Residual connections (aka skip, shortcut, passthrough or identity parameterization) have been shown to best the state-of-the-art in all too many occasions. Residual connections have become a mandatory feature for any state-of-the-art architecture. By throwing conditional logic into the concoction we arrive at an even more potent potion.

Microsoft has some very interesting stuff that is in fact related to this called Conditional networks “Decision Forests, Convolutional Networks and the Models in-Between”:

We present a systematic analysis of how to fuse conditional computation with representation learning and achieve a continuum of hybrid models with different ratios of accuracy vs. efficiency. We call this new family of hybrid models conditional networks. Conditional networks can be thought of as: i) decision trees augmented with data transformation operators, or ii) CNNs, with block-diagonal sparse weight matrices, and explicit data routing functions.

Our automatically-optimized conditional architecture (green circle) is ∼5 times faster and ∼6 times smaller than NiN, with same accuracy.

Here’s the wild looking architecture:

Credit: “Decision Forests, Convolutional Networks and the Models in-Between

With some very impressive results:

Credit: “Decision Forests, Convolutional Networks and the Models in-Between

The Microsoft solution uses conditional logic as network routing parameters. It is similar to a gated residual, but one that does not skip layers.

Activation functions and pooling layers have limitations. They are fixed and not modified during training. Furthermore, they are not information preserving in that they always destroy information between layers. Residuals and conditional networks are selective in how information traverses through layers. This kind of routing capability is not entirely new in that we’ve seen it in LSTMs. What is new though is that the routing happens not within a single LSTM node but rather across layers.

Whenever we talk about conditional logic we are faced with the problem of how to handle discrete values. DeepMind appears to have avery interesting ICLR 2017 paper that they call the Concrete distribution ( “The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables”). The distribution provides a method to backpropagate through a discrete, rather than the typical continuous, variable. These can lead to further refinement on how we train for conditional logic.

Nevertheless, when we look at the success of this new kind of architecture, we are left being perplexed in that it is difficult to image how SGD can reach convergence. Conditional logic will imply that entire sections of networks become inactive in certain contexts. I suspect that are certain constraints exists that guide the effective use of conditional logic. We’ll explore this in more detail later.

The Deep Learning AI Playbook: Strategy for Disruptive Artificial Intelligence

Further Reading

https://openreview.net/pdf?id=BkbY4psgg MAKING NEURAL PROGRAMMING ARCHITECTURES GENERALIZE VIA RECURSION

Recursion divides the problem into smaller pieces and drastically reduces the domain of each neural network component, making it tractable to prove guarantees about the overall system’s behavior. Our experience suggests that in order for neural architectures to robustly learn program semantics, it is necessary to incorporate a concept like recursion.

--

--