A Year in Computer Vision — Part 2 of 4
— Part Two: Segmentation, Super-res/Colourisation/Style Transfer, Action Recognition
The following piece is taken from a recent publication compiled by our research team relating to the field of Computer Vision. Parts one and two are available through our website presently, with the remaining parts (three and four) to be released in the near future.
The full publication will be available for free on our website in the coming weeks, Parts 1–2 are available now via: www.themtank.org
We would encourage readers to view the piece through our own website, as we include embedded content and easy navigational functions to make the report as dynamic as possible. Our website generates no revenue for the team and simply aims to make the materials as engaging and intuitive for readers as possible. Any feedback on the presentation there is wholeheartedly welcomed by us!
Please follow, share and support our work through whatever your preferred channels are (and clap to your hearts content!). Feel free to contact the editors with any questions or to see about potentially contributing to future works: email@example.com
Central to Computer Vision is the process of Segmentation, which divides whole images into pixel groupings which can then be labelled and classified. Moreover, Semantic Segmentation goes further by trying to semantically understand the role of each pixel in the image e.g. is it a cat, car or some other type of class? Instance Segmentation takes this even further by segmenting different instances of classes e.g. labelling three different dogs with three different colours. It is one of a barrage of Computer Vision applications currently employed in autonomous driving technology suites.
Perhaps, some of the best improvements in the area of segmentation come courtesy of FAIR, who continue to build upon their DeepMask work from 2015 . DeepMask generates rough ‘masks’ over objects as an initial form of segmentation. In 2016, Fair introduced SharpMask which refines the ‘masks’ provided by DeepMask, correcting the loss of detail and improving semantic segmentation. In addition to this, MultiPathNet identifies the objects delineated by each mask.
“To capture general object shape, you have to have a high-level understanding of what you are looking at (DeepMask), but to accurately place the boundaries you need to look back at lower-level features all the way down to the pixels (SharpMask).” — Piotr Dollar, 2016.
Figure 6: Demonstration of FAIR techniques in action
Video Propagation Networks attempt to create a simple model to propagate accurate object masks, assigned at first frame, through the entire video sequence along with some additional information.
In 2016, researchers worked on finding alternative network configurations to tackle the aforementioned issues of scale and localisation. DeepLab is one such example of this which achieves encouraging results for semantic image segmentation tasks. Khoreva et al. (2016) build on Deeplab’s earlier work (circa 2015) and propose a weakly supervised training method which achieves comparable results to fully supervised networks.
Computer Vision further refined the network sharing of useful information approach through the use of end-to-end networks, which reduce the computational requirements of multiple omni-directional subtasks for classification. Two key papers using this approach are:
- 100 Layers Tiramisu is a fully-convolutional DenseNet which connects every layer, to every other layer, in a feed-forward fashion. It also achieves SOTA on multiple benchmark datasets with fewer parameters and training/processing.
- Fully Convolutional Instance-aware Semantic Segmentation performs instance mask prediction and classification jointly (two subtasks).
COCO Segmentation challenge winner MSRA. 37.3% AP.
9.1% absolute jump from MSRAVC in 2015 in COCO challenge.
While ENet, a DNN architecture for real-time semantic segmentation, is not of this category, it does demonstrate the commercial merits of reducing computation costs and giving greater access to mobile devices.
Our work wishes to relate as much of these advancements back to tangible public applications as possible. With this in mind, the following contains some of the most interesting healthcare application of segmentation in 2016;
- A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images 
- 3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study 
- Semi-supervised Learning using Denoising Autoencoders for Brain Lesion Detection and Segmentation 
- 3D Ultrasound image segmentation: A Survey 
- A Fully Convolutional Neural Network based Structured Prediction Approach Towards the Retinal Vessel Segmentation 
- 3-D Convolutional Neural Networks for Glioblastoma Segmentation 
One of our favourite quasi-medical segmentation applications is FusionNet - a deep fully residual convolutional neural network for image segmentation in connectomics  benchmarked against SOTA electron microscopy (EM) segmentation methods.
Super-resolution, Style Transfer & Colourisation
Not all research in Computer Vision serves to extend the pseudo-cognitive abilities of machines, and often the fabled malleability of neural networks, as well as other ML techniques, lend themselves to a variety of other novel applications that spill into the public space. Last year’s advancements in Super-resolution, Style Transfer & Colourisation occupied that space for us.
Super-resolution refers to the process of estimating a high resolution image from a low resolution counterpart, and also the prediction of image features at different magnifications, something which the human brain can do almost effortlessly. Originally super-resolution was performed by simple techniques like bicubic-interpolation and nearest neighbours. In terms of commercial applications, the desire to overcome low-resolution constraints stemming from source quality and realisation of ‘CSI Miami’ style image enhancement has driven research in the field. Here are some of the year’s advances and their potential impact:
- Neural Enhance  is the brainchild of Alex J. Champandard and combines approaches from four different research papers to achieve its Super-resolution method.
Real-Time Video Super Resolution was also attempted in 2016 in two notable instances; , 
- RAISR: Rapid and Accurate Image Super-Resolution  from Google avoids the costly memory and speed requirements of neural network approaches by training filters with low-resolution and high-resolution image pairs. RAISR, as a learning-based framework, is two orders of magnitude faster than competing algorithms and has minimal memory requirements when compared with neural network-based approaches. Hence super-resolution is extendable to personal devices. There is a research blog available here. 
Figure 7: Super-resolution SRGAN example
The use of Generative Adversarial Networks (GANs) represent current SOTA for Super-resolution:
- SRGAN  provides photo-realistic textures from heavily downsampled images on public benchmarks, using a discriminator network trained to differentiate between super-resolved and original photo-realistic images.
Qualitatively SRGAN performs the best, although SRResNet performs best with peak-signal-to-noise-ratio (PSNR) metric but SRGAN gets the finer texture details and achieves the best Mean Opinion Score (MOS). “To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.”  All previous approaches fail to recover the finer texture details at large upscaling factors.
- Amortised MAP Inference for Image Super-resolution  proposes a method for calculation of Maximum a Posteriori (MAP) inference using a Convolutional Neural Network. However, their research presents three approaches for optimisation, all of which GANs perform markedly better on real image data at present.
Figure 8: Style Transfer from Nikulin & Novak
Undoubtedly, Style Transfer epitomises a novel use of neural networks that has ebbed into the public domain, specifically through last year’s facebook integrations and companies like Prisma  and Artomatix . Style transfer is an older technique but converted to a neural networks in 2015 with the publication of a Neural Algorithm of Artistic Style . Since then, the concept of style transfer was expanded upon by Nikulin and Novak  and also applied to video , as is the common progression within Computer Vision.
Figure 9: Further examples of Style Transfer
Style transfer as a topic is fairly intuitive once visualised; take an image and imagine it with the stylistic features of a different image. For example, in the style of a famous painting or artist. This year Facebook released Caffe2Go,  their deep learning system which integrates into mobile devices. Google also released some interesting work which sought to blend multiple styles to generate entirely unique image styles: Research blog  and full paper .
Besides mobile integrations, style transfer has applications in the creation of game assets. Members of our team recently saw a presentation by the Founder and CTO of Artomatix, Eric Risser, who discussed the technique’s novel application for content generation in games (texture mutation, etc.) and, therefore, dramatically minimises the work of a conventional texture artist.
Colourisation is the process of changing monochrome images to new full-colour versions. Originally this was done manually by people who painstakingly selected colours to represent specific pixels in each image. In 2016, it became possible to automate this process while maintaining the appearance of realism indicative of the human-centric colourisation process. While humans may not accurately represent the true colours of a given scene, their real world knowledge allows the application of colours in a way which is consistent with the image and another person viewing said image.
The process of colourisation is interesting in that the network assigns the most likely colouring for images based on its understanding of object location, textures and environment, e.g. it learns that skin is pinkish and the sky is blueish.
Three of the most influential works of the year, in our opinion, are as follows:
- Zhang et al. produced a method that was able to successfully fool humans on 32% of their trials. Their methodology is comparable to a “colourisation Turing test.” 
- Larsson et al.  fully automate their image colourisation system using Deep Learning for Histogram estimation.
- Finally, Lizuka, Simo-Serra and Ishikawa  demonstrate a colourisation model also based upon CNNs. The work outperformed the existing SOTA, we [the team] feel as though this work is qualitatively best also, appearing to be the most realistic. Figure 10 provides comparisons, however the image is taken from Lizuka et al.
Figure 10: Comparison of Colourisation Research
“Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN.”
In a test to see how natural their colourisation was, users were given a random image from their models and were asked, “does this image look natural to you?”
Their approach achieved 92.6%, the baseline achieved roughly 70% and the ground truth (the actual colour photos) were considered 97.7% of the time to be natural.
The task of action recognition refers to the both the classification of an action within a given video frame, and more recently, algorithms which can predict the likely outcomes of interactions given only a few frames before the action takes place. In this respect we see recent research attempt to imbed context into algorithmic decisions, similar to other areas of Computer Vision. Some key papers in this space are:
- Long-term Temporal Convolutions for Action Recognition  leverages the spatio-temporal structure of human actions, i.e. the particular movement and duration, to correctly recognise actions using a CNN variant. To overcome the sub-optimal temporal modelling of longer term actions by CNNs, the authors propose a neural network with long-term temporal convolutions (LTC-CNN) to improve the accuracy of action recognition. Put simply, the LTCs can look at larger parts of the video to recognise actions. Their approach uses and extends 3D CNNs ‘to enable action representation at a fuller temporal scale’.
“We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).”
- Spatiotemporal Residual Networks for Video Action Recognition  apply a variation of two stream CNN to the task of action recognition, which combines techniques from both traditional CNN approaches and recently popularised Residual Networks (ResNets). The two stream approach takes its inspiration from a neuroscientific hypothesis on the functioning of the visual cortex, i.e. separate pathways recognise object shape/colour and movement. The authors combine the classification benefits of ResNets by injecting residual connections between the two CNN streams.
“Each stream initially performs video recognition on its own and for final classification, softmax scores are combined by late fusion. To date, this approach is the most effective approach of applying deep learning to action recognition, especially with limited training data. In our work we directly convert image ConvNets into 3D architectures and show greatly improved performance over the two-stream baseline.” — 94% on UCF101 and 70.6% on HMDB51. Feichtenhofer et al. made improvements over traditional improved dense trajectory (iDT) methods and generated better results through use of both techniques.
- Anticipating Visual Representations from Unlabelled Video  is an interesting paper, although not strictly action classification. The program predicts the action which is likely to take place given a sequence of video frames up to one second before an action. The approach uses visual representations rather than pixel-by-pixel classification, which means that the program can operate without labeled data, by taking advantage of the feature learning properties of deep neural networks .
“The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions”.
The organisers of the Thumos Action Recognition Challenge  released a paper describing the general approaches for Action Recognition from the last number of years. The paper also provides a rundown of the Challenges from 2013–2015, future directions for the challenge and ideas on how to give computers a more holistic understanding of video through Action Recognition. We hope that the Thumos Action Recognition Challenge returns in 2017 after its (seemingly) unexpected hiatus.
Follow our profile on medium for the next instalment — Part 3 of 4: Towards a 3D Understanding of the World.
Please feel free to place all feedback and suggestions in the comments section and we’ll revert as soon as we can. Alternatively, you can contact us directly through: firstname.lastname@example.org
The full piece is available at: www.themtank.org/a-year-in-computer-vision
The M Tank
References in order of appearance
 Pinheiro, Collobert and Dollar. 2015. Learning to Segment Object Candidates. [Online] arXiv: 1506.06204. Available: arXiv:1506.06204v2
 Pinheiro et al. 2016. Learning to Refine Object Segments. [Online] arXiv: 1603.08695. Available: arXiv:1603.08695v2
 Zagoruyko, S. 2016. A MultiPath Network for Object Detection. [Online] arXiv: 1604.02135v2. Available: arXiv:1604.02135v2
 Dollar, P. 2016. Learning to Segment. [Blog] FAIR. Available: https://research.fb.com/learning-to-segment/
 Dollar, P. 2016. Segmenting and refining images with SharpMask. [Online] Facebook Code. Available: https://code.facebook.com/posts/561187904071636/segmenting-and-refining-images-with-sharpmask/
 Jampani et al. 2016. Video Propagation Networks. [Online] arXiv: 1612.05478. Available: arXiv:1612.05478v2
 Chen et al., 2016. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. [Online] arXiv: 1606.00915. Available: arXiv:1606.00915v1
 Khoreva et al. 2016. Simple Does It: Weakly Supervised Instance and Semantic Segmentation. [Online] arXiv: 1603.07485v2. Available: arXiv:1603.07485v2
 Jégou et al. 2016. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. [Online] arXiv: 1611.09326v2. Available: arXiv:1611.09326v2
 Li et al. 2016. Fully Convolutional Instance-aware Semantic Segmentation. [Online] arXiv: 1611.07709v1. Available: arXiv:1611.07709v1
 Paszke et al. 2016. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. [Online] arXiv: 1606.02147v1. Available: arXiv:1606.02147v1
 Vázquez et al. 2016. A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images. [Online] arXiv: 1612.00799. Available: arXiv:1612.00799v1
 Dolz et al. 2016. 3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study. [Online] arXiv: 1612.03925. Available: arXiv:1612.03925v1
 Alex et al. 2017. Semi-supervised Learning using Denoising Autoencoders for Brain Lesion Detection and Segmentation. [Online] arXiv: 1611.08664. Available: arXiv:1611.08664v4
 Mozaffari and Lee. 2016. 3D Ultrasound image segmentation: A Survey. [Online] arXiv: 1611.09811. Available: arXiv:1611.09811v1
 Dasgupta and Singh. 2016. A Fully Convolutional Neural Network based Structured Prediction Approach Towards the Retinal Vessel Segmentation. [Online] arXiv: 1611.02064. Available: arXiv:1611.02064v2
 Yi et al. 2016. 3-D Convolutional Neural Networks for Glioblastoma Segmentation. [Online] arXiv: 1611.04534. Available: arXiv:1611.04534v1
 Quan et al. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. [Online] arXiv: 1612.05360. Available: arXiv:1612.05360v2
 Connectomics refers to the mapping of all connections within an organism’s nervous system, i.e. neurons and their connections.
 Champandard, A.J. 2017. Neural Enhance (latest commit 30/11/2016). [Online] Github. Available: https://github.com/alexjc/neural-enhance [Accessed: 11/02/2017]
 Caballero et al. 2016. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. [Online] arXiv: 1611.05250. Available: arXiv:1611.05250v1
 Shi et al. 2016. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. [Online] arXiv: 1609.05158. Available: arXiv:1609.05158v2
 Romano et al. 2016. RAISR: Rapid and Accurate Image Super Resolution. [Online] arXiv: 1606.01299. Available: arXiv:1606.01299v3
 Milanfar, P. 2016. Enhance! RAISR Sharp Images with Machine Learning. [Blog] Google Research Blog. Available: https://research.googleblog.com/2016/11/enhance-raisr-sharp-images-with-machine.html [Accessed: 20/03/2017].
 Ledig et al. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. [Online] arXiv: 1609.04802. Available: arXiv:1609.04802v3
 Sønderby et al. 2016. Amortised MAP Inference for Image Super-resolution. [Online] arXiv: 1610.04490. Available: arXiv:1610.04490v1
 Prisma. 2017. [Website] Prisma. Available: https://prisma-ai.com/ [Accessed: 01/04/2017].
 Artomatix. 2017. [Website] Artomatix. Available: https://services.artomatix.com/ [Accessed: 01/04/2017].
 Gatys et al. 2015. A Neural Algorithm of Artistic Style. [Online] arXiv: 1508.06576. Available: arXiv:1508.06576v2
 Nikulin & Novak. 2016. Exploring the Neural Algorithm of Artistic Style. [Online] arXiv: 1602.07188. Available: arXiv:1602.07188v2
 Ruder et al. 2016. Artistic style transfer for videos. [Online] arXiv: 1604.08610. Available: arXiv:1604.08610v2
 Jia and Vajda. 2016. Delivering real-time AI in the palm of your hand. [Online] Facebook Code. Available: https://code.facebook.com/posts/196146247499076/delivering-real-time-ai-in-the-palm-of-your-hand/ [Accessed: 20/01/2017].
 Dumoulin et al. 2016. Supercharging Style Transfer. [Online] Google Research Blog. Available: https://research.googleblog.com/2016/10/supercharging-style-transfer.html [Accessed: 20/01/2017].
 Dumoulin et al. 2017. A Learned Representation For Artistic Style. [Online] arXiv: 1610.07629. Available: arXiv:1610.07629v5
 Zhang et al. 2016. Colorful Image Colorization. [Online] arXiv: 1603.08511. Available: arXiv:1603.08511v5
 Larsson et al. 2016. Learning Representations for Automatic Colorization. [Online] arXiv: 1603.06668. Available: arXiv:1603.06668v2
 Lizuka, Simo-Serra and Ishikawa. 2016. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. [Online] ACM Transaction on Graphics (Proc. of SIGGRAPH), 35(4):110. Available: http://hi.cs.waseda.ac.jp/~iizuka/projects/colorization/en/
 Varol et al. 2016. Long-term Temporal Convolutions for Action Recognition. [Online] arXiv: 1604.04494. Available: arXiv:1604.04494v1
 Feichtenhofer et al. 2016. Spatiotemporal Residual Networks for Video Action Recognition. [Online] arXiv: 1611.02155. Available: arXiv:1611.02155v1
 Vondrick et al. 2016. Anticipating Visual Representations from Unlabeled Video. [Online] arXiv: 1504.08023. Available: arXiv:1504.08023v2
 Conner-Simons, A., Gordon, R. 2016. Teaching machines to predict the future. [Online] MIT NEWS. Available: https://news.mit.edu/2016/teaching-machines-to-predict-the-future-0621 [Accessed: 03/02/2017].
 Idrees et al. 2016. The THUMOS Challenge on Action Recognition for Videos “in the Wild”. [Online] arXiv: 1604.06182. Available: arXiv:1604.06182v1