DeepLearn 2017 review

14 min readAug 7, 2017

Three weeks ago, I attended DeepLearn 2017: The International Summit and Summer School on Deep Learning in Bilbao, Spain organized by the University of Deusto. My interest for this event was sparked due to the mix of speakers from academia and industry. To name a few: Microsoft Research, Dartmouth College, Facebook AI Research, University of Toronto, IBM, Columbia University, CMU, Salesforce, NVIDIA, NYU. I hoped to gain some insights on productional use cases of deep learning while also having the chance to go deeper into some more theoretical topics. In this post, I summarize my very subjective highlights of the summit.

Unsupervised Learning

A recurring topic across all sessions was unsupervised learning. Much of the success of deep learning is based on large data sets (e.g. ImageNet) where for each data set a target label is present. This labelling however requires manual effort and is hence expensive. But there are also other things to consider. Li Deng, Chief AI Officer at Citadel and former Chief Scientist of AI with Microsoft pointed out that expert disagreement is another big challenge of supervised learning next to being expensive. The manual labelling process results often in noisy labels as the humans — ranging from domain experts to Mechanical Turks — often disagree on the label for a given observation in a data set.

A lot of expectations and hope is set on unsupervised learning. Yann LeCun made previously a cake analogy that was picked up by Li Deng in his keynote. Pure reinforcement learning represents the cherry at the top of a cake. Everybody would like to have it, but it is not very substantial carrying only a few bits of information for some samples. With reinforcement learning, a reward in form of a scalar is predicted every few action sequences. Supervised learning can be seen as the icing of a cake. Each sample can carry anything between 10 and 10,000 bits of information while trying to predict human labeled data. But in stark contrast, unsupervised learning is able to extract millions of bits of information from each sample by predicting any part of the input. One example is the famous word2vec model that predicts a center word given its context words leveraging large unlabeled text corpora. This makes LeCun compare unsupervised learning with the cake itself.

Some of the traditional approaches for unsupervised learning are hierarchical clustering, k-means clustering, or maximum entropy. More recently, generative models are being applied more and more for unsupervised learning. These can be distinguished in three broader categories: energy-based models, latent variable models, and adversarial learning models. Generative Adversarial Networks (GAN) are a very hot topic. It is quite hard to keep pace with the sheer amount of publications introducing (somehow) new variations of GANs. In a nutshell, a GAN consists of two networks that try to trick/beat each other constantly. In the vision domain, a generator network creates an image. Multiple generated images are then presented together with some real images to a discriminator network. The discriminator network has to tell which one of these images is actually a real image and which one a ‘fake’ image generated by the other network. This ‘game’ is then repeated for a very long time. By the end, the generator network is able to produce images like the one shown below:

Source: Ian Goodfellow, NIPS 2016 Tutorial: Generative Adversarial Networks

However, this comes with some caveats. Training a GAN is very tricky and computationally expensive. But, if the problem domain (e.g. vision) is suitable and labelled data scarce, GANs should definitely be considered as an alternative to supervised learning.

Behavior Learning

George Cybenko from the Dartmouth College held a course about recent advances of machine learning techniques for behavior modeling and learning. There are a wide range of possible applications for behavior learning in areas such as commerce, finance, healthcare, security, etc. One could be interested in classifying the intent of a specific behavior (good vs. bad). Another use-case is anomaly detection: can we identify abnormal network traffic from a specific hardware client? The holy grail for many people is prediction: can we predict future actions based on past observed behavior — not thinking at all about stock markets ;-)

Cybenko provided us a framework which I will describe briefly in the following. The required taxonomy for behavioral models are:

0-th order: the atomic set of observed actions
1st order: atomic actions, the associated frequencies and probabilities, and the context
2nd order: 1st order conditioned on 1st order
Adversarial: rational, non-stationary, adaptive (games)

We illustrate this with a simple example. Frank is a worker and we are interested in his work behavior. Frank can either work (observe him at work) W — or stay at home (not at work) N. We could now do several things: predict whether Frank will be at work tomorrow, classify whether Frank acts strangely, etc. Franks past work history is a sequence of discrete actions:

WWWWWNWWWNNNWWWWWWWNNNW

Frank’s 0-th order model are the atomic actions: work (W) or not at work (N). The 1st order model are summary statistics: Frank was at work 83% the past 100 days. One could now do statistical hypothesis testing and asks questions like: is Frank spending significantly more time not at work in the past 10 days compared to the previous 100 days? The 2nd order model becomes even more interesting: is it more probable that Frank does not work for 2 subsequent days than an isolated day?

A 2nd order model with more states can even capture more details. If Frank has not been at work for one day (N1), Frank will be with 90% probability not at work the next day either. Once he is in the state N+ (more than 1 day not at work), the probability that he won’t be at work the next day goes down to 30%.

A more common way to model Franks behavior are Hidden Markov Models that have non-observable hidden states. Let’s assume that Frank has 3 hidden states: healthy (H), sick (S), and personal (P). Frank would take a personal day to play some golf, but we don’t really know it. We have only access to the two observable states work (W) and not at work (N). A 2nd order model of Frank with these three hidden states could look as follows:

These 2nd order models can be modeled and trained explicitly with Hidden Markov Models and the Expectation Maximization algorithm. But it wouldn’t be a deep learning summit if a neural network wouldn’t have been applied for this problem. Recurrent neural networks (RNN) qualify to doing the same, but without explicitly specifying the possible hidden states. I won’t explain here the nuts and bolts of a RNN, but refer to Wikipedia for interested readers. Coming back to the Franks 2nd order model, it would be implemented via a RNN as follows:

Each action (at work / not at work) from the sequence of actions is fed into the network as an input. The network has a hidden layer that captures interactions with past actions. At each discrete time step, a prediction or classification is made via the output layer.

RNN are difficult to train. Two variations of plain ‘vanilla’ RNN helped to make them more suitable for sequence modelling tasks: Long Short Term Memory (LSTM) and Gated Recurrent Unit (GRU) networks. These two network variations implement mechanisms that help the network to access signals from many time steps before. Again, I won’t go too much into details here, but I refer to Wikipedia and an excellent blog post from Christopher Olah explaining LSTMs.

To summarize, modeling and learning behaviors with RNN is a promising approach and should be investigated further. Related to that, a recent MIT Technology Review article sheds some light on recent advances in the area of Computational Psychiatry. Machine learning is applied to understand better traits and patterns of people with mental illness.

Natural Language Processing

Natural language processing (NLP) was covered in various keynotes and courses throughout the summit. Richard Socher (formerly Stanford, now with Salesforce) presented a model that solves multiple tasks in parallel. The interesting part here is that the tasks are on different levels: both on token and sentence level. The tasks on individual token level are part-of-speech tagging and chunking. On sentence level, the tasks consist of natural language inference (understanding entailment vs. contradiction) and measuring semantic relatedness. The following figure illustrates the architecture of the model:

Source: Hashimoto et al. (2017), A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks.

Marc’Aurelio Ranzato (Facebook AI Research) held a foundational course about deep learning applied to vision, speech and text processing. His course was more hands-on with basic examples in PyTorch. Unfortunately, the examples were in my opinion too ‘foundational’ and I would have wished for more advanced implementations. Though, I found the third session of his course really valuable. It was about modeling sequences and he presented some examples and tied these to different scenarios which are best shown on a 2x2 matrix:

Source: Marc’Aurelio Ranzato, DeepLearn 2017

For tasks like text classification or language modelling, the input is sequential, but the output is fix. A single prediction needs to be made, e.g. is a given news story falling into the category ‘Economy’ or ‘Markets’? Image captioning in contrast, takes a dense representation of an image and outputs a sequence of tokens describing the image. It becomes a bit more complicated when both the input and output are sequential. In these situations, models have usually an encoder and decoder. The encoder processes the input sequence and returns a dense representation of the input. This is then passed to the decoder that generates an output sequence from this dense representation of the input. Sequence to sequence models were notoriously hard to train, but more recently, attention mechanisms helped to improve the performance. With attention, the decoder not only takes the dense input representation into account, but also the input sequence. At each output step, the decoder pays attention to only one input token. This can be implemented in multiple ways. One way of doing it is to set a softmax on top of the input sequence that returns an attention distribution. The generator now ‘knows’ on which words to focus. A very good introduction to the attention mechanism is available on distill.pub.

Framework Wars

Some years ago, the most dominant deep learning frameworks were Theano and Torch. The landscape has changed quite a lot since then. Every major tech company has developed its own, or contributes significantly to an open-source framework. To name a few: TensorFlow (Google), CNTK (Microsoft), PyTorch (Facebook), MXNet (Amazon), NNabla (Sony). I started initially with Theano, but since January this year switched completely over to PyTorch. I also tried Tensorflow and CNTK, but designing, implementing, training and debugging deep neural networks with PyTorch is pure joy. Frameworks like TensorFlow, Theano or CNTK have a static computation graph. PyTorch in contrast is imperative and model are created at runtime. You can do fancy stuff, like creating recursive neural networks (yes, recursive, not recurrent).

It was very interesting to hear to which frameworks other people and organizations are moving. Richard Socher and his research lab at Salesforce also switched to PyTorch. Most attendants of the summit I spoke to work with TensorFlow + Keras. No one really trains models on a local machine, but most of the people I spoke to use cloud providers that offer GPU virtual machine instances. In my opinion, it is not clear which framework will be the dominating one in a few years down the road. Abstraction APIs like Keras make it somewhat irrelevant which tensor framework you are using ‘under the hood’. Keras already supports Theano and Tensorflow, CNTK was just added and MXNet is also working on bindings for Keras. With that being said, if you don’t do any super fancy stuff and just want to prototype and iterate quickly, a combination of Keras + (TensorFlow | CNTK | Theano | …) should work just fine. For more exotic models, I recommend to work with PyTorch for the reasons I mentioned above.

Bias Induction

While bias is rather something undesirable in traditional statistical methods, it is seen as something useful in machine learning. A lot of the success of deep learning is based on the sheer amount of labelled training data and excessive computational power of modern graphical processing units (GPUs). More often than not, there are situations where one lacks sufficient labelled training data. Michael C. Mozer from the University of Colorado, Boulder held a very interesting course about incorporating domain bias into neural networks to overcome this issue. There are four broader means of imposing a domain bias.

1. Data augmentation

In the computer vision domain, we know that there are specific invariances of image representations. Objects on images are invariant to spatial transformations and a symmetry exists between image mirrors. This can be exploited to augment the training data with additional samples. We could select smaller random patches from the original image and label them with the same training label.

Applying horizontal reflections of the training data doubles its size in one simple step.

Shifting the RGB channel values by a constant increases the training data size by simulating different light conditions.

2. Loss function

A crucial element of a neural network is the loss function. It quantifies how much it ‘hurts’ to make an error. It must also be differentiable so the gradient can be propagated back through the network adjusting the network weights. Common loss functions for image processing tasks are the pixel-wise mean squared error (MSE) or mean absolute error (MAE). Let us take the task of image reconstruction: given a blurry input, we need to reconstruct the original image in a higher resolution. The pixel-wise MSE however is too simple to capture the complex underlying structure of images nor does it capture any characteristics of the human visual perception. Improving loss functions for a given domain is an active research area and Mozer and his co-authors presented recently a new loss function that is better suited for image processing tasks than MSE or MAE: the multiscale structural-similarity score (MS-SSIM). This loss downsamples iteratively an image and quantifies at each scale the intensity, contrast and structural mismatch between two images. The final score is then aggregated over all scales. It seems to work. The following image illustrates four images and their reconstructions with a 128 hidden unit variational autoencoder. The only difference is in the loss function.

Source: Snell et al. (2016), Learning to Generate Images With Perceptual Similarity Metrics

3. Representation

Deep neural networks are attributed the capability to learn automatically multiple levels of representations from raw input signals. Hence, the argumentation goes, no manual feature engineering is required. It might however be beneficial to leverage domain-specific representations. Again, the popular word2vec method was mentioned. In this case, the bias from the task of language modelling can be transferred to other NLP tasks by using the resulting word2vec embeddings.

4. Architectural constraints

One example for architectural constraints as a form of domain bias is the ‘what’ vs. ‘where’ cortical processing. A neural network was used to locate and classify a given shape on a 5x5 receptive field. The baseline model is a neural network with output nodes for the localization and classification of the shape. All hidden layer nodes are connected to all output layer nodes. While making an analogy of the processing pathways in human brains, the authors of the paper split the hidden layer in part and connected one part of the hidden layer only to the output nodes of the localization output nodes and the other half of the hidden layer to the output nodes of the classification output nodes.

Source: Rueckl et al. (1989), Why are “what” and “where” processed by separate cortical visual systems? A computational investigation.

The authors reported improvements in some cases, but to be honest I found the results quite counterintuitive. Empirical findings around multitask learning suggest that the joint training of multiple related tasks leads to generalization improvement. I refer interested readers to a paper from Rich Caruana: Multitask Learning.

In my opinion, siamese networks are a better example of architectural constraints. A siamese networks contains — as the name suggests — two identical sub-networks. These sub-networks share the same weights and each parameter update is effectively applied to both. Siamese networks are often used to measure similarities between two sentences. A very basic siamese network could consist of two RNN sub-networks that return each the last hidden activations for a sentence. This output can be seen as a continuous vector representation of the sentences. The similarity of both sentences is then measured with the cosine of the vector representations.

Hawkes Process Memory Unit

Mozer then went on to present us some of his ongoing research. His broader research question is whether the cognitive architecture of human memory can act as an inspiration for the design of neural networks with regards to memory. In experiments with students and their ability to retain knowledge over a period of time, it was observed that the temporal distribution of study has a direct influence on the amount of forgetting. Spaced and regular study yields more robust and durable learning than massed study. These peculiarities of temporal distributions not only apply to studying and retaining knowledge, but also to a variety of human activities, such as purchase patterns, restaurant reservations, online gaming, social media engagement, or email communication. Intensities of specific activities — in the initial example: retaining knowledge — can be modelled with Hawkes processes. The intensity of a current activity depends on past event history, decays over time and is self excitatory. Practical applications of Hawkes processes involve the prediction of crimes, earthquakes or financial transactions.

Mozer proposed a so called Hawkes Process Memory (HPM) unit that can be used with neural networks. Unlike LSTMs, HPMs do not have an output or forget gate, but only an input gate. They hold a history of past events and the memory persistence depends on this history.

Source: Michael C. Mozer, DeepLearn 2017

Mozer evaluated this new unit on multiple datasets and HPMs are on par with LSTMs. While in Mozer’s own words, these results are not a breakthrough, I really enjoyed following his presentation. Taking a step back and thinking about research in general, almost all of us can agree that research is inefficient and risky. We follow venues that most of the times do not end in ‘success’. Often, it is the process of tinkering around and following ideas with the try and error approach. The discovery of Penicillin was not planned, but rather a lucky accident. And so, it was very interesting to see how Mozer picked up a rough idea, implemented it step-by-step, analyzed intermediate results, adapted it, and last but not least, not gave up. And while you might say: it is not a breakthrough, I think coming up with a unit for recurrent neural networks that achieve similar results like LSTMs, is quite an achievement.