Notes from Deep Learning Summit 2015 London — Day 1

Last week I had the opportunity to attend the 3rd Deep Learning Summit, this time in London after the previous in San Francisco and Boston.

DL Summit, organised by RE.WORK, brings together people with different backgrounds, from industry professionals to academy, in 2 fast paced days, packed with 20 mins talks and nice networking breaks.

Here are my notes from day 1, if you were attending or speaking and I got something wrong please let me know!

Videos of the talks are being posted on youtube. You can find the follow up post on the second and final day here.

After a welcome from Alison Lowndes of NVDIA, the morning started with Alex Graves talking about Neural Turing Machines (NTMs, Paper and code). Alex is one of the most important researchers in Recurrent Neural Networks (RNNs) and is part of Google DeepMind. The idea behind NTMs is to learn programs instead of patterns. One of the difficulties has been in coding program operations so that they are differentiable, making NTMs trainable by gradient descent. They have already been able to train NTMs to perform basic algorithms like copy, loop, and sort. They are now looking at solving NP-hard problems such as Travelling Salesman, initial results look promising.

Neural Turing Machine at school, learning to sort

Koray Kavukcuoglu, also from Google DeepMind, followed talking about End to End Learning of Agents. Koray is also one of the creators of the Torch framework, heavily used both at Google DeepMind and Facebook AI Research. The topic of his talk was general AI, where the same system can operate across a wide range of tasks and learn automatically from raw inputs. He presented the famous Deep Q Network (DQN) algorithm which learnt to play Atari 2600 games better than humans and ended up on Nature cover (Paper). DQNs are a combination of deep learning (for end-to-end training, from raw pixels to action values) and reinforcement learning, the latter in a sentence being an agent (ex. player) that learns by acting in an environment (ex. game) with the goal of maximising a reward (ex. score). Koray then presented Gorila (Paper), which is a distributed architecture for training DQNs in which many actors take actions in parallel, achieving better result in 41 out of 49 games compared to the original local architecture. More on Gorila can be found also in David Silver talk at ICLR 2015 (Slides Video1 Video2). One of the unsolved issues of DQNs is long term strategies (ex. find a key that will open a door later in the game). They are also working on transfer learning among games. DeepMind is hiring, you can send your application at

Gorila, for when you need large scale reinforcment learning

The format then changed with a fireside chat between Ben Medlock, Co-Founder & CTO of Swiftkey, and Martin Bryan of The Next Web. Swiftkey is a smart keyboard that replaces built-in smartphone keyboards. Current algorithms are focused on corrections and suggestions, the goal forward is to predict user intents rather than words. The first version powered by deep learning is almost ready. DL can help in analyzing difficult languages like Chinese and Finnish, and leverage more information on context (location, app, time, etc.) compared to traditional Natural Language Processing. DL can also be useful for longer term analysis, with RNNs taking into account also previous sentences for better prediction.

Next up Alison Lowndes from NVIDIA talked about Deep Learning Impact on Modern Life. She gave a general overview of deep learning, including the 3 drivers behind Neural Networks renaissance (more data, better models and powerful GPUs). Alison gave some nice recent examples like Giraffe (Paper and code), a chess engine that through self-playing, in just 72 Hours, learnt to play at International Master Level.

Giraffe, becoming a chess pro in 72 hours

Sander Dieleman, then PhD at Ghent University, now at Google DeepMind, talked about how he won the Kaggle competition on plankton classification, together with other Ghent PhDs. As a model, they used a CNN based on OxfordNet (Paper), the CNN that won ImageNet Challenge 2014. One of the hard things about the challenge was that there were only 30.000 examples for 121 classes, so they did aggressive data augmentation to avoid overfitting (rotation, translation, rescaling, flipping, etc.). Sander wrote a very good blog post on the solution.

A nice way to augment plancton dataset

Jeffrey de Fauw, another PhD at Ghent University and Data Scientist, presented his solution of the Kaggle Diabetic Retinopathy competition. The goal of the competition was to identify signs of diabetic retinopathy in eye images (diabetic retinopathy being the leading cause of blindness in the working-age population of the developed world). Again, a small dataset (35k labelled with left+right eye), with skewed distribution and noisy inputs. He shared some of the lessons he learned:

  • Start with a small network size to iterate more quickly
  • Don’t mess too much with filters
  • Oversample for smaller classes and augment data (brightness, etc.)

Jeffrey wrote a very good blog post on the solution too.

Challenges from real data, with unbalanced classes and camera noise

Andrew Simpson, Research Fellow at University of Surrey, talked about Perpetual Learning Machines (PLMs). PLMs are a new type of deep neural network that learns on the fly. Andrew said that current DNNs have some flaws, in particular they need to be trained before being used and are forever frozen in a single state of knowledge, RNNs with LSTM have the same issue since they can use memory for predictions but not for further training. PLMs are made of two DNNs, one that classifies the image (storage DNN), the other able to generate new ones (recall DNN). They are paired using Perpetual Stochastic Gradient Descent, for each iteration a random class is chosen and this input is used by the recall DNN to synthesise the respective training image. The recalled training image is then used with the random class to train both networks for a single step of backprop SGD. Via “new experience” SGD steps, new classes can then be added on the fly, without needing to train a new DNN from scratch. More in this papers Paper1 Paper2

PSGD training the 2 coupled Deep Neural Networks

Next up was Matthew Zeiler, Founder & CEO of Clarifai, talking about their API that can classify pictures over 10k concepts (a concept being either an object, an adjective or an action). The API can be used for video understanding too. He gave a pretty impressive demo of video understanding, with concepts highlighted through the video timeline, making the video easily searchable; it should be very attractive also for anyone doing video editing! They have a strong focus on performance (a 3,5 minute video being processed in 6 seconds), leveraging AWS GPUs and a proprietary toolkit optimised for speed and memory. They now support concepts in 21 languages and made a significant effort for localisation. They will further expand in healthcare to support medical analysis made with field sensors (ex. pictures of ears, mouth and nose). Clarifai is hiring

Matthew Zeiler showing video annotation in chinese (image courtesy of Courtney Corley)

Next up was Max Welling, Professor of Computer Science at University of Amsterdam and founder of Scyfer BV, a DL startup focused on healthcare. He first presented the difficulties in applying machine learning in healthcare, namely the curse of dimensionality (a TB data for person but “few” patients) and the curse of privacy (data locked up in each hospital, missing holistic view). As possible solutions, he presented

  • Generative models to augment datasets
  • Exploit symmetries in data
  • Remove known bias (ex. some hospital may treat diseases at a different stage)
  • Use bayesian approaches to reduce overfitting

He further elaborated, showing some of his recent work

  • Bayesian Dark Knowledge (Paper, Hugo Larochelle notes), in which the goal is to learn a single NN that performs like an ensemble of NNs, reducing the storage of weights and having calibrated probabilities as output
  • Dropout as Variational Bayesian Inference (Paper, Hugo Larochelle notes), with a new algorithm to learn the dropout rate, useful to avoid overfitting
  • A yet unpublished paper on Domain Invariance (Deep Generative Models for Invariant Representations by Louizos et al, 2015) in which the NN can create a latent representation of the input data purging selected information (ex. photo illumination), useful to remove biases
Pictures can be nicely clustered after purging illumination information

Then came the last talk of the morning, with Lior Wolf, Faculty Member of Tel-Aviv University, talking about Image Annotation using Deep Learning and Fisher Vectors (Paper Pdf). He started by saying he was approaching NLP as a computer vision guy, another evidence of how DL is becoming more and more cross domain. Lior then talked about 3 tasks:

  • Image Annotation (assign a description from a given list to an image)
  • Image Search (find image given description)
  • Description Synthesis (create a new description for an unseen image)

For Image Annotation and Search, they started by converting images to vectors with CNNs and words to vectors with Word2Vec. Most of the research went on how to combine word vectors into sentence vectors, coming up with a model based on Fisher vectors. Once they had sentence vectors, they used Canonical Correlation Analysis (CCA) to project image representations and sentence representations in the same space, so that images and sentences could be matched finding nearest neighbours. For Description Synthesis, RNNs were used with inputs from the CNN->CCA pipeline. One of the open issues is that the system decides what to describe, there still needs to be research to direct attention and influence which part of the image is to be described.

A nice description automatically generated

Back from a very good lunch, the afternoon started with 2 talks on semantic segmentation, which means recognising and delineating objects in an image. It is a useful task for road scenes understanding (self driving cars), robots grasping objects, and healthcare (segmenting tumors, dental cavities,…) among other.

Sven Behnke, Head of Computer Science Department at University of Bonn, covered 2 algorithms, Neural Abstraction Pyramid and Semantic RGB-D Perception. Neural Abstraction Pyramid (NAP) is his historical work (around ‘98, Paper). NAP is a NN that includes also lateral connections, working closer to how the human visual system does. It was successfully applied to image denoising and face localisation. His recent work is on Semantic RGB-D Perception, which are DNNs whose input come from a Kinect-like sensor and include information on distance. With distance information, they are able to calculate the height of each pixel and scale input accordingly, resulting in much better segmentation and semantic interpretation (Paper Pdf). They also achieved good results by applying a depth mask to the original object and adding the colorised depth image as an input to the CNN (Paper Pdf)

Taking advantage of distance information for better semantic segmentation

Bernardino Romera Paredes, Postdoctoral Research Assistant at University of Oxford, followed with a new algorithm on semantic segmentation (Paper), which uses a fully convolutional network coupled with a Conditional Random Fields as Recurrent Neural Networks, trained end to end. They achieved the best accuracy results, but at the moment the algorithm is not fast enough for real time usage. They have a nice online demo at

Cats are the official animal of DL practitioners

Next up was Miriam Redi, Research Scientist at Yahoo Labs, with a talk on The Subjective Eye of Machine Vision. The goal of her research is to find the invisible in pictures, features such as sentiment, society, aesthetics, creativity, and culture. She presented four different projects:

  • Computational portrait aesthetics (Paper). Using picture features and photographers annotations, they were able to predict portrait beauty, finding that picture features like contrast and sharpness have a high correlation with perceived beauty, while gender, age and race are uncorrelated
  • Help discovery of beautiful, unpopular content pictures (Paper). The goal was to help discover beautiful but unnoticed content in Flickr. They first created a large annotated aesthetic dataset through crowdsourcing, then built a model that was able to find new beautiful pictures
  • Predict sentiment across cultures (Paper). Here they created a dataset with sentiment annotations across 12 languages. Interestingly, they tried to transfer learning, and found that it works well across latin languages (a classifier trained with French annotations is good at predicting Italian sentiment), while predicting Chinese sentiment starting from latin datasets doesn’t work well
  • Predict creativity using Vine videos (Paper). Again, starting with crowd annotated videos and video features, they found that their definition of creativity can be modelled, when taking into account both aesthetic features and novelty features

Overall, even though most of the work was done with manual feature encoding and almost no deep learning, there were interesting questions to explore and the results were pretty engaging

The definition of creativity in Yahoo Labs project

Cees Snoek, Director at QUVA, followed with Video Understanding: What to Expect Today and Tomorrow?. Cees talked about video labelling, said that Qualcomm is building the Zeroth platform, which enables pre-trained deep learning models to do object recognition on-device in your mobile (if it is equipped with a Qualcomm Snapdragon chip). The technology can potentially be extended to the whole IoT devices. In the second part of the talk, he presented a method for action recognition in videos. The initial idea is to filter the frames using tubelets (Paper Pdf) that include only the area around the moving subject. The search space is much reduced and classification speed increases. Then, by using object recognition (Paper Pdf) and calculating objects distance to actions using Word2Vec vectors they can predict actions without a labelled action dataset (Paper Pdf)

Tubelets generated from a sequence of frames

A panel followed on What Does the Future hold for Deep Learning? with guests Tony Robinson, Founder & CTO of Speechmatics, and Daniel Hulme, CEO of Satalia, John Henderson, Principal at White Star Capital, acting as moderator. Tony Robinson is a pioneer in speech recognition using NN, back from the ‘90s, then moving to other algorithms during AI winter, finally going back to where he started. Daniel Hulme is focusing on hard problems (ex. vehicle routing) where the importance is on action rather than prediction, using symbolic AI (while he defined DL as sub-symbolic AI). Being asked what is AI?, Daniel answered Goal Directed Adaptive Behaviour, while Tony said What computers can’t do now :) Regarding the future, Tony said he can only predict the increase in computer powers and hopes there is less hype on DL to avoid another winter. Daniel of course sees a symbolic AI renaissance by 2020. Regarding spoken dialogue, they see much to be done to address ambiguity, and more work on delayed reward. They were also asked about AI threat to humanity, which was quickly dismissed. Nevertheless, they formulated the problem with 2 scenarios, one of intelligent robots (Terminator-style), the other of stupid robots solving problems in a stupid way (think about eradicating cancer, the easiest way to solve the problem for a robot would be to exterminate humans…). Given our ability to predict consequences, the latter seems to me much more dangerous.

The last talk of the day was by Sébastien Bratières, Speech Evangelist at dawin gmbh & PhD Researcher at Uni. of Cambridge, on Deep Learning for Speech Recognition. Sébastien gave an overview of how DL changed speech recognition pipeline. In a nutshell, speech recognition is made of an acoustic model (AM) that predicts word / phoneme sequence from raw audio and a language model (LM) that selects a word based on previous ones. In the last 5–10 years, AMs have gone from Gaussian Mixture Model + Hidden Markov Models to Deep Neural Networks, while LMs have gone from N-grams to RNNs. The pipeline has been simplified but there are still legacy models (GMM+HMM still used to prepare input for DNN), the future goal is to train End to End using only DL. Still, there a lot of “invariant” issues that are important for user experience and not solved even by DL, such as adaptation (different accents, background noise, etc. that were not present in the training set). Looking at the future, Sébastien said that humans do not learn speech recognition through transcribed speech, there could be room for unsupervised models (zero-resource methods).

A simplified speech recognition pipeline

That was all for day 1 of the summit! Overall very interesting and diverse. The day continued with a dinner for a group of participants, with more talks and the opportunity to have more fun and meet new interesting people!