PyTorch, Dynamic Computational Graphs and Modular Deep Learning


Deep Learning frameworks such as Theano, Caffe, TensorFlow, Torch, MXNet and CNTK are the work horses of Deep Learning work. These frameworks as well as the GPU (predominantly Nvidia) are the what enables the rapid growth of Deep Learning. It was refreshing to hear Nando de Freitas acknowledge their work in the recently concluded NIPS 2016 conference. Infrastructure does not get enough of the recognition it deserves in the academic community. Yet, programmers toil on to continually tweak and improve their frameworks.

Yesterday, a new framework was revealed by Facebook and a bunch of other partners (Twitter * NVIDIA * SalesForce * ParisTech * CMU * Digital Reasoning * INRIA * ENS). PyTorch came out of stealth development. PyTorch is an improvement over the popular Torch framework (Torch was a favorite at DeepMind until TensorFlow came along). The obvious change is the support of Python over the less often used Lua language. Almost all of the more popular frameworks use Python, so it is a relief that Torch has finally joined the club.

There are many improvements in the new PyTorch framework, however the most notable change is the adoption of a Dynamic Computational Graph. There are some lesser known frameworks that have this capability (i.e. Chainer and Dynet), in fact PyTorch borrowed a lot of ideas from Chainer. This capability is also referred to as “Define by Run” as opposed to the more conventional “Define and Run”:


Basically, DL frameworks maintain a computational graph that defines the order of computations that are required to be performed. For people new to DL frameworks, it does seem unnatural that one finds two “interpreters” in the framework. One interpreter is the host language (i.e. Python) and a second one is the computational graph.

So what you typically have in these frameworks is a language that sets up the computational graph and an execution mechanism that is different from the host language. This kind of strange setup is primarily motivated for efficiency and optimization reasons. A computational graph can be optimized and run in parallel in the target GPU. This cumbersome setup has made it difficult for researchers to try out more novel approaches.

One analogy to make is that it’s like Fortran. Fortran, despite is age, is still used in a lot of computational intensive problems. Fortran however has static allocation of memory. This has its pros and cons, but the main benefit is that it can optimize computation. So static computational graphs are kind of like Fortran. Now dynamic computational graphs are like dynamic memory, that is memory that is allocated on the heap. This is valuable for situations where you cannot determine before hand how much memory is required. Similarly, dynamic computational graphs are valuable for situations where you cannot determine the computation. One clear example of this are recursive computations that are based on variable data.

In the space of NLP where language can come in various expression lengths, dynamic computational graphs are essential. One can just imagine how a grammar is parsed to realize the need for a stack and therefore dynamic memory and thus dynamic computation. Speaking of a stack, there are new DL architectures that make use of a stack!

Now you can always shoehorn this into a static computational graph, but its as inconvenient as programming a parser without using a stack. The folks at Google, makers of Tensorflow, have a paper out that shoe horns TensorFlow to give in dynamic capabilities (see: ). Now you don’t always need this kind of flexibility, but if you are in an exploratory environment, any kind of additional convenience will help speed up the process. Another feature about PyTorch is that it works just like Python. So there no split-brain experience that there’s another execution engine that running the computation. Because of this, it’s much easier to debug and much easier to create unique extensions.

A Reddit user writes:

The bonuses aren’t just in NLP, but all over ML. For image recognition, for example, dynamic toolkits enable you to easily process images of different sizes. Perhaps you want to run NNs over graphs, or implement non-trivial inference algorithms as in the Neural CRF.

With this development, it would not be unreasonable to expect that Deep Learning architectures will traverse the same evolutionary path as traditional computation. That is from monolithic stand-alone programs, to more modular programs. Introducing dynamic computational graphs are like introducing the concept of procedure when all one previously had was “goto” statements. It is exactly the concept of procedure that we can write our programs in a composable manner. One of course can argue that DL architectures have no need for a stack, however one only needs to see recent research on HyperNetworks and Stretcher networks. There are networks in research were context switching like a stack appears to be effective.

There is also another concept that is related to this, this is called Modular Deep Learning. I predicted that in 2017:

In the old days we had monolithic DL systems with single analytic objective functions. In the new world, I expect to see systems with two or more networks cooperation or competing to arrive at a optimal solution that likely will not be in analytic form.

This is an even richer kind of modularity. That is, what we are seeing is something akin to information encapsulation (i.e. that feature found in Object Oriented Programming). What you see in GANs are cooperating networks that are actually encapsulated away from each other. We’ve actually also seen this kind of research from Maluuba (recently acquired by Microsoft). So its, absolutely astonishing the pace of development that just into the new year (it’s just January right now) we are quickly building a new kind of infrastructure to support even more advanced forms of Deep Learning.

To keep up with the pace, signup to Design Patterns for Deep Learning. Also, make sure you don’t miss any Deep Learning developments. Subscribe to our newsletter:

Update: Informative Reddit discussion: