Joint work with Fangda Du, Bert Travacca, Armin Askari, Alicia Tsai, UC Berkeley.
August 28, 2019
I n the world of machine learning, neural network and associated deep learning models are quickly becoming dominant, with very significant amounts of work being published every day, often demonstrating very good empirical results. At this point, it is fair to say that our theoretical understanding of such models is very limited, notably when it comes to issues such as robustness, architecture learning, why such “over-parameterized” models work, etc. This blog is part of a wider research corpus focusing on the theoretical structure of neural networks, which aims at laying foundations for the “science of deep learning”.
Prediction rules in deep learning are based on a forward, recursive computation through several layers. Implicit rules go much beyond, by relying on the solution of an implicit (or, “fixed-point”) equation that has to be numerically solved, in order to make the prediction. Specifically, for a given input vector u, the predicted vector y is of the form
y = Cx + Du, where x = ϕ(Ax+Bu).
Here, A, B, C, D are matrices containing the model weights; ϕ is a given (nonlinear) activation function, such as the ReLU; finally, the so-called “state” x is a n-vector that stores the all hidden features of the model, is not expressed explicitly — it is implicitly defined via the “fixed-point” (or, equilibrium) equation x = ϕ(Ax+Bu).
A t first glance, the above type of model seems to be very specific. Perhaps surprisingly, it includes as special case most known neural network architectures, including standard feedforward networks, CNNs, RNNs, and many more. We can specify such architectures with a proper definition of the activation function ϕ, and by imposing adequate linear structure in the model matrices. For example, constraining matrix A to be strictly upper block-diagonal corresponds to the class of feedforward networks. Further specifying structure in each block, such as equal elements along diagonals, allows one to encode convolutional layers.
The picture on the left illustrates the structure of the model matrix A for a five-layer image classification network known as LeNet-5. The matrix has a strictly upper block-diagonal structure, with the size of each block corresponding to the dimensions of each of the 5 layers. Convolutional layers appear as “diagonal stripes”, reflecting the fact that those parts of the matrix are constant along diagonals.
Implicit models have a lot more expressive power than standard networks, as measured by the number of parameters for a given dimension of the hidden features. The promise of implicit rules is: imagine if we could let the architecture of the network emerge from data, as opposed to be dictated by a human? Indeed, the matrix above shows a wide area with zeros, waiting to be filled up…
Recent work on implicit rules has demonstrated their potential. Kolter and collaborators [1,5] showcased success of their implicit framework, termed “Deep Equilibrium Models”, for the task of sequence modeling. Chen et al.  used implicit methods to construct a related class of models, known as neural ordinary differential equations. Zhang and co-authors  have demonstrated the usefulness of a kind of implicit rule in the context of RNNs, while  demonstrates how the approach helps solve some longstanding issues in training RNNs. Building on earlier work , the paper  provides some theoretical and algorithmic foundations for implicit learning.
One of the thorny issues in implicit rules is well-posedness and numerical tractability: how can we guarantee that there exists a unique solution x, and if so, how can we compute it efficiently? In standard networks, the issue is not present, since one can always express the hidden state variable in explicit form, thanks to a recursive elimination, that is, via a forward pass through the layers. As seen in , there are simple conditions on the matrix A that guarantee both well-posedness and tractability, for example that the largest row sum of absolute values of elements in A does not exceed 1, in which case the recursion
x(t+1) = ϕ(Ax(t)+Bu), t=0,1,2,…
converges quickly to the unique solution. The largest row-sum constraint tends to encourage sparsity of A, which in turn brings about many benefits: speedup at test time, architecture simplification, and reduced memory requirements.
The training problem for implicit learning can be addressed via standard unconstrained optimization methods that are popular in the deep learning community, such as stochastic gradient descent (SGD). However, computing gradients within a fixed-point equation is challenging. In addition, SGD does not in itself guarantee well-posedness of the prediction rule; handling properly the corresponding constraint requires constrained optimization, for example block-coordinate descent (BCD) methods, which tend to converge very fast .
A nice aspect of BCD methods is their ability to handle interesting constraints or penalties. In the implicit rule above, a constraint on the input matrix of the form
with κ small positive hyper-parameter, will encourage B to be “column-sparse”, that is entire columns of B are zero; in turn, the resulting model will select important inputs, and discard the others, effectively accomplishing feature selection via deep learning.
We generated a synthetic data set of 400 points, using a given implicit model with 20 hidden features, 50 inputs and 100 outputs, and with a column-sparse matrix B. Using a training model with an incorrect guess of 10 hidden features, the BCD algorithm nevertheless recovers the correct sparsity pattern of the generative model, as evidenced by the fact that the vector of column norms of the learned matrix (left column) and that of the generative model (right column) matrices match.
Implicit models are new, and more work is needed to assess their true potential. They can be thought of as “neural nets on steroids”, in that they allow for a much larger model of parameters. There are many other potential benefits of implicit models, including some exciting developments towards robustness, interpretability, and architecture learning. More on this later!
- Bai, S., Kolter, J. Z., and Koltun, V. (2019). Deep equilibrium models. Preprint submitted.
- Chen, T. Q., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. In NeurIPS 2018, pages 6571–6583.
- El Ghaoui, L., Gu, F., Travacca, B., and Askari, A. (2019). Deep implicit learning. Preprint arxiv.org/abs/1908.06315.
- Gu, F., Askari, A., and El Ghaoui, L. (2018). Fenchel lifted networks: A Lagrange relaxation of neural network training. Preprint arXiv:1811.08039.
- Kolter, J.Z. (2019). Deep equilibrium models: one (implicit) layer is all you need. Presentation at the Simmons Institute, August 2019.
- Zhang, Z., Kag, A., Sullivan, A., Saligrama, V. (2019). Equilibrated Recurrent Neural Network: Neuronal Time-Delayed Self-Feedback Improves Accuracy and Stability. Preprint https://arxiv.org/pdf/1903.00755.pdf.
- Kag, A., Zhang, Z., Saligrama, V. (2019). RNNs Evolving in Equilibrium: A Solution to the Vanishing and Exploding Gradients. Preprint https://arxiv.org/abs/1908.08574.