A Year in Computer Vision — Part 2 of 4

— Part Two: Segmentation, Super-res/Colourisation/Style Transfer, Action Recognition

The following piece is taken from a recent publication compiled by our research team relating to the field of Computer Vision. Parts one and two are available through our website presently, with the remaining parts (three and four) to be released in the near future.

The full publication will be available for free on our website in the coming weeks, Parts 1–2 are available now via: www.themtank.org

We would encourage readers to view the piece through our own website, as we include embedded content and easy navigational functions to make the report as dynamic as possible. Our website generates no revenue for the team and simply aims to make the materials as engaging and intuitive for readers as possible. Any feedback on the presentation there is wholeheartedly welcomed by us!

Please follow, share and support our work through whatever your preferred channels are (and clap to your hearts content!). Feel free to contact the editors with any questions or to see about potentially contributing to future works: info@themtank.com


Central to Computer Vision is the process of Segmentation, which divides whole images into pixel groupings which can then be labelled and classified. Moreover, Semantic Segmentation goes further by trying to semantically understand the role of each pixel in the image e.g. is it a cat, car or some other type of class? Instance Segmentation takes this even further by segmenting different instances of classes e.g. labelling three different dogs with three different colours. It is one of a barrage of Computer Vision applications currently employed in autonomous driving technology suites.

Perhaps, some of the best improvements in the area of segmentation come courtesy of FAIR, who continue to build upon their DeepMask work from 2015 [46]. DeepMask generates rough ‘masks’ over objects as an initial form of segmentation. In 2016, Fair introduced SharpMask[47] which refines the ‘masks’ provided by DeepMask, correcting the loss of detail and improving semantic segmentation. In addition to this, MultiPathNet[48] identifies the objects delineated by each mask.

To capture general object shape, you have to have a high-level understanding of what you are looking at (DeepMask), but to accurately place the boundaries you need to look back at lower-level features all the way down to the pixels (SharpMask).” — Piotr Dollar, 2016.[49]
Figure 6: Demonstration of FAIR techniques in action
Note: The above pictures demonstrate the segmentation techniques employed by FAIR. These include the application of DeepMask, SharpMask and MultiPathNet techniques which are applied in that order. This process allows accurate segmentation and classification in a variety of scenes. Source: Dollar (2016)[50]

Video Propagation Networks[51] attempt to create a simple model to propagate accurate object masks, assigned at first frame, through the entire video sequence along with some additional information.

In 2016, researchers worked on finding alternative network configurations to tackle the aforementioned issues of scale and localisation. DeepLab[52] is one such example of this which achieves encouraging results for semantic image segmentation tasks. Khoreva et al. (2016)[53] build on Deeplab’s earlier work (circa 2015) and propose a weakly supervised training method which achieves comparable results to fully supervised networks.

Computer Vision further refined the network sharing of useful information approach through the use of end-to-end networks, which reduce the computational requirements of multiple omni-directional subtasks for classification. Two key papers using this approach are:

  • 100 Layers Tiramisu[54] is a fully-convolutional DenseNet which connects every layer, to every other layer, in a feed-forward fashion. It also achieves SOTA on multiple benchmark datasets with fewer parameters and training/processing.
  • Fully Convolutional Instance-aware Semantic Segmentation[55] performs instance mask prediction and classification jointly (two subtasks). 
    COCO Segmentation challenge winner MSRA. 37.3% AP. 
    9.1% absolute jump from MSRAVC in 2015 in COCO challenge.

While ENet[56], a DNN architecture for real-time semantic segmentation, is not of this category, it does demonstrate the commercial merits of reducing computation costs and giving greater access to mobile devices.

Our work wishes to relate as much of these advancements back to tangible public applications as possible. With this in mind, the following contains some of the most interesting healthcare application of segmentation in 2016;

One of our favourite quasi-medical segmentation applications is FusionNet [63]- a deep fully residual convolutional neural network for image segmentation in connectomics [64] benchmarked against SOTA electron microscopy (EM) segmentation methods.

Semantic Segmentation applied to street views from a car

Super-resolution, Style Transfer & Colourisation

Not all research in Computer Vision serves to extend the pseudo-cognitive abilities of machines, and often the fabled malleability of neural networks, as well as other ML techniques, lend themselves to a variety of other novel applications that spill into the public space. Last year’s advancements in Super-resolution, Style Transfer & Colourisation occupied that space for us.

Super-resolution refers to the process of estimating a high resolution image from a low resolution counterpart, and also the prediction of image features at different magnifications, something which the human brain can do almost effortlessly. Originally super-resolution was performed by simple techniques like bicubic-interpolation and nearest neighbours. In terms of commercial applications, the desire to overcome low-resolution constraints stemming from source quality and realisation of ‘CSI Miami’ style image enhancement has driven research in the field. Here are some of the year’s advances and their potential impact:

  • Neural Enhance [65] is the brainchild of Alex J. Champandard and combines approaches from four different research papers to achieve its Super-resolution method.

Real-Time Video Super Resolution was also attempted in 2016 in two notable instances; [66], [67]

  • RAISR: Rapid and Accurate Image Super-Resolution [68] from Google avoids the costly memory and speed requirements of neural network approaches by training filters with low-resolution and high-resolution image pairs. RAISR, as a learning-based framework, is two orders of magnitude faster than competing algorithms and has minimal memory requirements when compared with neural network-based approaches. Hence super-resolution is extendable to personal devices. There is a research blog available here. [69]
Figure 7: Super-resolution SRGAN example
Note: From left to right: bicubic interpolation (the objective worst performer for focus), Deep residual network optimised for MSE, deep residual generative adversarial network optimized for a loss more sensitive to human perception, original High Resolution (HR) image. Corresponding peak signal to noise ratio (PSNR) and structural similarity (SSIM) are shown in two brackets. [4 x upscaling] The reader may wish to zoom in on the middle two images (SRResNet and SRGAN) to see the difference between image smoothness vs more realistic fine details.
Source: Ledig et al. (2017) [70]

The use of Generative Adversarial Networks (GANs) represent current SOTA for Super-resolution:

  • SRGAN [71] provides photo-realistic textures from heavily downsampled images on public benchmarks, using a discriminator network trained to differentiate between super-resolved and original photo-realistic images.

Qualitatively SRGAN performs the best, although SRResNet performs best with peak-signal-to-noise-ratio (PSNR) metric but SRGAN gets the finer texture details and achieves the best Mean Opinion Score (MOS). “To our knowledge, it is the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.” [72] All previous approaches fail to recover the finer texture details at large upscaling factors.

  • Amortised MAP Inference for Image Super-resolution [73] proposes a method for calculation of Maximum a Posteriori (MAP) inference using a Convolutional Neural Network. However, their research presents three approaches for optimisation, all of which GANs perform markedly better on real image data at present.
Figure 8: Style Transfer from Nikulin & Novak
Note: Transferring different styles to a photo of a cat (original top left).
Source: Nikulin & Novak (2016)

Undoubtedly, Style Transfer epitomises a novel use of neural networks that has ebbed into the public domain, specifically through last year’s facebook integrations and companies like Prisma [74] and Artomatix [75]. Style transfer is an older technique but converted to a neural networks in 2015 with the publication of a Neural Algorithm of Artistic Style [76]. Since then, the concept of style transfer was expanded upon by Nikulin and Novak [77] and also applied to video [78], as is the common progression within Computer Vision.

Figure 9: Further examples of Style Transfer
Note: The top row (left to right) represent the artistic style which is transposed onto the original images which are displayed in the first column (Woman, Golden Gate Bridge and Meadow Environment). Using conditional instance normalisation a single style transfer network can capture 32 style simultaneously, five of which are displayed here. The full suite of images in available in the source paper’s appendix. This work will feature in the International Conference on Learning Representations (ICLR) 2017. 
Source: Dumoulin et al. (2017, p. 2) [79]

Style transfer as a topic is fairly intuitive once visualised; take an image and imagine it with the stylistic features of a different image. For example, in the style of a famous painting or artist. This year Facebook released Caffe2Go, [80] their deep learning system which integrates into mobile devices. Google also released some interesting work which sought to blend multiple styles to generate entirely unique image styles: Research blog [81] and full paper [82].

Besides mobile integrations, style transfer has applications in the creation of game assets. Members of our team recently saw a presentation by the Founder and CTO of Artomatix, Eric Risser, who discussed the technique’s novel application for content generation in games (texture mutation, etc.) and, therefore, dramatically minimises the work of a conventional texture artist.

Colourisation is the process of changing monochrome images to new full-colour versions. Originally this was done manually by people who painstakingly selected colours to represent specific pixels in each image. In 2016, it became possible to automate this process while maintaining the appearance of realism indicative of the human-centric colourisation process. While humans may not accurately represent the true colours of a given scene, their real world knowledge allows the application of colours in a way which is consistent with the image and another person viewing said image.

The process of colourisation is interesting in that the network assigns the most likely colouring for images based on its understanding of object location, textures and environment, e.g. it learns that skin is pinkish and the sky is blueish.

Three of the most influential works of the year, in our opinion, are as follows:
  • Zhang et al. produced a method that was able to successfully fool humans on 32% of their trials. Their methodology is comparable to a “colourisation Turing test.” [83]
  • Larsson et al. [84] fully automate their image colourisation system using Deep Learning for Histogram estimation.
  • Finally, Lizuka, Simo-Serra and Ishikawa [85] demonstrate a colourisation model also based upon CNNs. The work outperformed the existing SOTA, we [the team] feel as though this work is qualitatively best also, appearing to be the most realistic. Figure 10 provides comparisons, however the image is taken from Lizuka et al.
Figure 10: Comparison of Colourisation Research
Note: From top to bottom — column one contains the original monochrome image input which is subsequently colourised through various techniques. The remaining columns display the results generated by other prominent colourisation research in 2016. When viewed from left to right, these are Larsson et al. [84] 2016 (column two), Zhang et al. [83] 2016 (Column three), and Lizuka, Simo-Serra and Ishikawa. [85] 2016, also referred to as “ours” by the authors (Column four). The quality difference in colourisation is most evident in row three (from the top) which depicts a group of young boys. We believe Lizuka et al.’s work to be qualitatively superior (Column four). Source: Lizuka et al. 2016 [86]

Furthermore, our architecture can process images of any resolution, unlike most existing approaches based on CNN.

In a test to see how natural their colourisation was, users were given a random image from their models and were asked, “does this image look natural to you?”

Their approach achieved 92.6%, the baseline achieved roughly 70% and the ground truth (the actual colour photos) were considered 97.7% of the time to be natural.

Action Recognition

The task of action recognition refers to the both the classification of an action within a given video frame, and more recently, algorithms which can predict the likely outcomes of interactions given only a few frames before the action takes place. In this respect we see recent research attempt to imbed context into algorithmic decisions, similar to other areas of Computer Vision. Some key papers in this space are:

  • Long-term Temporal Convolutions for Action Recognition [87] leverages the spatio-temporal structure of human actions, i.e. the particular movement and duration, to correctly recognise actions using a CNN variant. To overcome the sub-optimal temporal modelling of longer term actions by CNNs, the authors propose a neural network with long-term temporal convolutions (LTC-CNN) to improve the accuracy of action recognition. Put simply, the LTCs can look at larger parts of the video to recognise actions. Their approach uses and extends 3D CNNs ‘to enable action representation at a fuller temporal scale’.

We report state-of-the-art results on two challenging benchmarks for human action recognition UCF101 (92.7%) and HMDB51 (67.2%).

  • Spatiotemporal Residual Networks for Video Action Recognition [88] apply a variation of two stream CNN to the task of action recognition, which combines techniques from both traditional CNN approaches and recently popularised Residual Networks (ResNets). The two stream approach takes its inspiration from a neuroscientific hypothesis on the functioning of the visual cortex, i.e. separate pathways recognise object shape/colour and movement. The authors combine the classification benefits of ResNets by injecting residual connections between the two CNN streams.

Each stream initially performs video recognition on its own and for final classification, softmax scores are combined by late fusion. To date, this approach is the most effective approach of applying deep learning to action recognition, especially with limited training data. In our work we directly convert image ConvNets into 3D architectures and show greatly improved performance over the two-stream baseline.” — 94% on UCF101 and 70.6% on HMDB51. Feichtenhofer et al. made improvements over traditional improved dense trajectory (iDT) methods and generated better results through use of both techniques.

  • Anticipating Visual Representations from Unlabelled Video [89] is an interesting paper, although not strictly action classification. The program predicts the action which is likely to take place given a sequence of video frames up to one second before an action. The approach uses visual representations rather than pixel-by-pixel classification, which means that the program can operate without labeled data, by taking advantage of the feature learning properties of deep neural networks [90].

“The key idea behind our approach is that we can train deep networks to predict the visual representation of images in the future. Visual representations are a promising prediction target because they encode images at a higher semantic level than pixels yet are automatic to compute. We then apply recognition algorithms on our predicted representation to anticipate objects and actions”.

The organisers of the Thumos Action Recognition Challenge [91] released a paper describing the general approaches for Action Recognition from the last number of years. The paper also provides a rundown of the Challenges from 2013–2015, future directions for the challenge and ideas on how to give computers a more holistic understanding of video through Action Recognition. We hope that the Thumos Action Recognition Challenge returns in 2017 after its (seemingly) unexpected hiatus.

Follow our profile on medium for the next instalment — Part 3 of 4: Towards a 3D Understanding of the World.
Please feel free to place all feedback and suggestions in the comments section and we’ll revert as soon as we can. Alternatively, you can contact us directly through: info@themtank.com

The full piece is available at: www.themtank.org/a-year-in-computer-vision

Many thanks,

The M Tank

References in order of appearance

[46] Pinheiro, Collobert and Dollar. 2015. Learning to Segment Object Candidates. [Online] arXiv: 1506.06204. Available: arXiv:1506.06204v2

[47] Pinheiro et al. 2016. Learning to Refine Object Segments. [Online] arXiv: 1603.08695. Available: arXiv:1603.08695v2

[48] Zagoruyko, S. 2016. A MultiPath Network for Object Detection. [Online] arXiv: 1604.02135v2. Available: arXiv:1604.02135v2

[49] Dollar, P. 2016. Learning to Segment. [Blog] FAIR. Available: https://research.fb.com/learning-to-segment/

[50] Dollar, P. 2016. Segmenting and refining images with SharpMask. [Online] Facebook Code. Available: https://code.facebook.com/posts/561187904071636/segmenting-and-refining-images-with-sharpmask/

[51] Jampani et al. 2016. Video Propagation Networks. [Online] arXiv: 1612.05478. Available: arXiv:1612.05478v2

[52] Chen et al., 2016. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. [Online] arXiv: 1606.00915. Available: arXiv:1606.00915v1

[53] Khoreva et al. 2016. Simple Does It: Weakly Supervised Instance and Semantic Segmentation. [Online] arXiv: 1603.07485v2. Available: arXiv:1603.07485v2

[54] Jégou et al. 2016. The One Hundred Layers Tiramisu: Fully Convolutional DenseNets for Semantic Segmentation. [Online] arXiv: 1611.09326v2. Available: arXiv:1611.09326v2

[55] Li et al. 2016. Fully Convolutional Instance-aware Semantic Segmentation. [Online] arXiv: 1611.07709v1. Available: arXiv:1611.07709v1

[56] Paszke et al. 2016. ENet: A Deep Neural Network Architecture for Real-Time Semantic Segmentation. [Online] arXiv: 1606.02147v1. Available: arXiv:1606.02147v1

[57] Vázquez et al. 2016. A Benchmark for Endoluminal Scene Segmentation of Colonoscopy Images. [Online] arXiv: 1612.00799. Available: arXiv:1612.00799v1

[58] Dolz et al. 2016. 3D fully convolutional networks for subcortical segmentation in MRI: A large-scale study. [Online] arXiv: 1612.03925. Available: arXiv:1612.03925v1

[59] Alex et al. 2017. Semi-supervised Learning using Denoising Autoencoders for Brain Lesion Detection and Segmentation. [Online] arXiv: 1611.08664. Available: arXiv:1611.08664v4

[60] Mozaffari and Lee. 2016. 3D Ultrasound image segmentation: A Survey. [Online] arXiv: 1611.09811. Available: arXiv:1611.09811v1

[61] Dasgupta and Singh. 2016. A Fully Convolutional Neural Network based Structured Prediction Approach Towards the Retinal Vessel Segmentation. [Online] arXiv: 1611.02064. Available: arXiv:1611.02064v2

[62] Yi et al. 2016. 3-D Convolutional Neural Networks for Glioblastoma Segmentation. [Online] arXiv: 1611.04534. Available: arXiv:1611.04534v1

[63] Quan et al. 2016. FusionNet: A deep fully residual convolutional neural network for image segmentation in connectomics. [Online] arXiv: 1612.05360. Available: arXiv:1612.05360v2

[64] Connectomics refers to the mapping of all connections within an organism’s nervous system, i.e. neurons and their connections.

[65] Champandard, A.J. 2017. Neural Enhance (latest commit 30/11/2016). [Online] Github. Available: https://github.com/alexjc/neural-enhance [Accessed: 11/02/2017]

[66] Caballero et al. 2016. Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation. [Online] arXiv: 1611.05250. Available: arXiv:1611.05250v1

[67] Shi et al. 2016. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. [Online] arXiv: 1609.05158. Available: arXiv:1609.05158v2

[68] Romano et al. 2016. RAISR: Rapid and Accurate Image Super Resolution. [Online] arXiv: 1606.01299. Available: arXiv:1606.01299v3

[69] Milanfar, P. 2016. Enhance! RAISR Sharp Images with Machine Learning. [Blog] Google Research Blog. Available: https://research.googleblog.com/2016/11/enhance-raisr-sharp-images-with-machine.html [Accessed: 20/03/2017].

[70] ibid

[71] Ledig et al. 2017. Photo-Realistic Single Image Super-Resolution Using a Generative Adversarial Network. [Online] arXiv: 1609.04802. Available: arXiv:1609.04802v3

[72] ibid

[73] Sønderby et al. 2016. Amortised MAP Inference for Image Super-resolution. [Online] arXiv: 1610.04490. Available: arXiv:1610.04490v1

[74] Prisma. 2017. [Website] Prisma. Available: https://prisma-ai.com/ [Accessed: 01/04/2017].

[75] Artomatix. 2017. [Website] Artomatix. Available: https://services.artomatix.com/ [Accessed: 01/04/2017].

[76] Gatys et al. 2015. A Neural Algorithm of Artistic Style. [Online] arXiv: 1508.06576. Available: arXiv:1508.06576v2

[77] Nikulin & Novak. 2016. Exploring the Neural Algorithm of Artistic Style. [Online] arXiv: 1602.07188. Available: arXiv:1602.07188v2

[78] Ruder et al. 2016. Artistic style transfer for videos. [Online] arXiv: 1604.08610. Available: arXiv:1604.08610v2

[79] ibid

[80] Jia and Vajda. 2016. Delivering real-time AI in the palm of your hand. [Online] Facebook Code. Available: https://code.facebook.com/posts/196146247499076/delivering-real-time-ai-in-the-palm-of-your-hand/ [Accessed: 20/01/2017].

[81] Dumoulin et al. 2016. Supercharging Style Transfer. [Online] Google Research Blog. Available: https://research.googleblog.com/2016/10/supercharging-style-transfer.html [Accessed: 20/01/2017].

[82] Dumoulin et al. 2017. A Learned Representation For Artistic Style. [Online] arXiv: 1610.07629. Available: arXiv:1610.07629v5

[83] Zhang et al. 2016. Colorful Image Colorization. [Online] arXiv: 1603.08511. Available: arXiv:1603.08511v5

[84] Larsson et al. 2016. Learning Representations for Automatic Colorization. [Online] arXiv: 1603.06668. Available: arXiv:1603.06668v2

[85] Lizuka, Simo-Serra and Ishikawa. 2016. Let there be Color!: Joint End-to-end Learning of Global and Local Image Priors for Automatic Image Colorization with Simultaneous Classification. [Online] ACM Transaction on Graphics (Proc. of SIGGRAPH), 35(4):110. Available: http://hi.cs.waseda.ac.jp/~iizuka/projects/colorization/en/

[86] ibid

[87] Varol et al. 2016. Long-term Temporal Convolutions for Action Recognition. [Online] arXiv: 1604.04494. Available: arXiv:1604.04494v1

[88] Feichtenhofer et al. 2016. Spatiotemporal Residual Networks for Video Action Recognition. [Online] arXiv: 1611.02155. Available: arXiv:1611.02155v1

[89] Vondrick et al. 2016. Anticipating Visual Representations from Unlabeled Video. [Online] arXiv: 1504.08023. Available: arXiv:1504.08023v2

[90] Conner-Simons, A., Gordon, R. 2016. Teaching machines to predict the future. [Online] MIT NEWS. Available: https://news.mit.edu/2016/teaching-machines-to-predict-the-future-0621 [Accessed: 03/02/2017].

[91] Idrees et al. 2016. The THUMOS Challenge on Action Recognition for Videos “in the Wild”. [Online] arXiv: 1604.06182. Available: arXiv:1604.06182v1