A Year in Computer Vision — Part 3 of 4

— Part Three: Toward a 3D understanding of the world

The following piece is taken from a recent publication compiled by our research team relating to the field of Computer Vision. Parts one, two and three are available through our website presently, with the remaining part four to be released next Monday.

The full publication will be available for free on our website in the coming weeks, Parts 1–3 combined are available now via: www.themtank.org

We would encourage readers to view the piece through our own website, as we include embedded content and easy navigational functions to make the report as dynamic as possible. Our website generates no revenue for the team and simply aims to make the materials as engaging and intuitive for readers as possible. Any feedback on the presentation there is wholeheartedly welcomed by us!

Please follow, share and support our work through whatever your preferred channels are (and clap to your hearts content!). Feel free to contact the editors with any questions or to see about potentially contributing to future works: info@themtank.com


A key goal of Computer Vision is to recover the underlying 3D structure from 2D observations of the world.” — Rezende et al. (2016, p. 1) [92]

In Computer Vision, the classification of scenes, objects and activities, along with the output of bounding boxes and image segmentation is, as we have seen, the focus of much new research. In essence, these approaches apply computation to gain an ‘understanding’ of the 2D space of an image. However, detractors note that a 3D understanding is imperative for systems to successfully interpret, and navigate, the real world.

For instance, a network may locate a cat in an image, colour all of its pixels and classify it as a cat. But does the network fully understand where the cat in the image is, in the context of the cat’s environment?

One could argue that the computer learns very little about the 3D world from the above tasks. Contrary to this, humans understand the world in 3D even when examining 2D pictures, i.e. perspective, occlusion, depth, how objects in a scene are related, etc. Imparting these 3D representations and their associated knowledge to artificial systems represents one of the next great frontiers of Computer Vision. A major reason for thinking this is that, generally;

the 2D projection of a scene is a complex function of the attributes and positions of the camera, lights and objects that make up the scene. If endowed with 3D understanding, agents can abstract away from this complexity to form stable, disentangled representations, e.g., recognizing that a chair is a chair whether seen from above or from the side, under different lighting conditions, or under partial occlusion.” [93]

However, 3D understanding has traditionally faced several impediments. The first concerns the problem of both ‘self and normal occlusion’ along with the numerous 3D shapes which fit a given 2D representation. Understanding problems are further compounded by the inability to map different images of the same structures to the same 3D space, and in the handling of the multi-modality of these representations [94]. Finally, ground-truth 3D datasets were traditionally quite expensive and difficult to obtain which, when coupled with divergent approaches for representing 3D structures, may have led to training limitations.

We feel that the work being conducted in this space is important to be mindful of. From the embryonic, albeit titillating early theoretical applications for future AGI systems and robotics, to the immersive, captivating applications in augmented, virtual and mixed reality which will affect our societies in the near future. We cautiously predict exponential growth in this area of Computer Vision, as a result of lucrative commercial applications, which means that soon computers may start reasoning about the world rather than just about pixels.

3D Objects

This first section is a tad scattered, acting as a catch-all for computation applied to objects represented with 3D data, inference of 3D object shape from 2D images and Pose Estimation; determining the transformation of an object’s 3D pose from 2D images [95]. The process of reconstruction also creeps in ahead of the following section which deals with it explicitly. However, with these points in mind, we present the work which excited our team the most in this general area:

  • OctNet: Learning Deep 3D Representations at High Resolutions [96] continues the recent development of convolutional networks which operate on 3D data, or Voxels (which are like 3D pixels), using 3D convolutions. OctNet is ‘a novel 3D representation which makes deep learning with high-resolution inputs tractable’. The authors test OctNet representations by ‘analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.’ The paper’s central contribution is its exploitation of sparsity in 3D input data which then enables much more efficient use of memory and computation.
  • ObjectNet3D: A Large Scale Database for 3D Object Recognition [97] — contributes a database for 3D object recognition, presenting 2D images and 3D shapes for 100 object categories. ‘Objects in the images in our database [taken from ImageNet] are aligned with the 3D shapes [taken from the ShapeNet repository], and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object.’ Baseline experiments are provided on: Region proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, and image-based 3D shape retrieval.
  • 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction [98]— creates a reconstruction of an object ‘in the form of a 3D occupancy grid using single or multiple images of object instance from arbitrary viewpoints.’ Mappings from images of objects to 3D shapes are learned using primarily synthetic data, and the network can train and test without requiring ‘any image annotations or object class labels’. The network comprises a 2D-CNN, a 3D Convolutional LSTM (an architecture newly created for purpose) and a 3D Deconvolutional Neural Network. How these different components interact and are trained together end-to-end is a perfect illustration of the layering capable with Neural Networks.

Figure 11: Example of 3D-R2N2 functionality

Note: Images taken from Ebay (left) and an overview of the functionality of 3D-R2N2 (right). Source: Choy et al. (2016, p. 3) [99]

Note from source: Some sample images of the objects we [the authors] wish to reconstruct — notice that views are separated by a large baseline and objects’ appearance shows little texture and/or are non-lambertian. (b) An overview of our proposed 3D-R2N2: The network takes a sequence of images (or just one image) from arbitrary (uncalibrated) viewpoints as input (in this example, 3 views of the armchair) and generates voxelized 3D reconstruction as an output. The reconstruction is incrementally refined as the network sees more views of the object.

3D-R2N2 generates ‘rendered images and voxelized models’ using ShapeNet models and facilitates 3D object reconstruction where structure from motion (SfM) and simultaneous localisation and mapping (SLAM) approaches typically fail:

Our extensive experimental analysis shows that our reconstruction framework i) outperforms the state-of-the-art methods for single view reconstruction, and ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail.

  • 3D Shape Induction from 2D Views of Multiple Objects [100] uses “Projective Generative Adversarial Networks” (PrGANs), which train a deep generative model allowing accurate representation of 3D shapes, with the discriminator only being shown 2D images. The projection module captures the 3D representations and converts them to 2D images before passing to the discriminator. Through iterative training cycles the generator improves projections by improving the 3D voxel shapes it generates.

Figure 12: PrGAN architecture segment

Note from source: The PrGAN architecture for generating 2D images of shapes. A 3D voxel representation (323) and viewpoint are independently generated from the input z (201-d vector). The projection module renders the voxel shape from a given viewpoint (θ, φ) to create an image. The discriminator consists of 2D convolutional and pooling layers and aims to classify if the input image is generated or real. 
Source: Gadhelha et al. (2016, p. 3) [101]

In this way the inference ability is learned through an unsupervised environment:

The addition of a projection module allows us to infer the underlying 3D shape distribution without using any 3D, viewpoint information, or annotation during the learning phase.

Additionally, the internal representation of the shapes can be interpolated, meaning discrete commonalities in voxel shapes allow transformations from object to object, e.g. from car to aeroplane.

  • Unsupervised Learning of 3D Structure from Images [102] presents a completely unsupervised, generative model which demonstrates ‘the feasibility of learning to infer 3D representations of the world’ for the first time. In a nutshell the DeepMind team present a model which “learns strong deep generative models of 3D structures, and recovers these structures from 3D and 2D images via probabilistic inference”, meaning that inputs can be both 3D and 2D.

DeepMind’s strong generative model runs on both volumetric and mesh-based representations. The use of Mesh-based representations with OpenGL allows more knowledge to be built in, e.g. how light affects the scene and the materials used. “Using a 3D mesh-based representation and training with a fully-fledged black-box renderer in the loop enables learning of the interactions between an object’s colours, materials and textures, positions of lights, and of other objects.” [103]

The models are of high quality, capture uncertainty and are amenable to probabilistic inference, allowing for applications in 3D generation and simulation. The team achieve the first quantitative benchmark for 3D density modelling on 3D MNIST and ShapeNet. This approach demonstrates that models may be trained end-to-end unsupervised on 2D images, requiring no ground-truth 3D labels.

Human Pose Estimation and Keypoint Detection

Human Pose Estimation attempts to find the orientation and configuration of human body parts. 2D Human Pose Estimation, or Keypoint Detection, generally refers to localising body parts of humans e.g finding the 2D location of the knees, eyes, feet, etc.

However, 3D Pose Estimation takes this even further by finding the orientation of the body parts in 3D space and then an optional step of shape estimation/modelling can be performed. There has been a tremendous amount of improvement across these sub-domains in the last few years.

In terms of competitive evaluation “the COCO 2016 Keypoint Challenge involves simultaneously detecting people and localizing their keypoints”[104]. The European Convention on Computer Vision (ECCV) [105] provides more extensive literature on these subjects, however we would like to highlight:

  • Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.[106]

This method set SOTA performance on the inaugural MSCOCO 2016 keypoints challenge with 60% average precision (AP) and won the best demo award at ECCV, video: Video [107]

Realtime Multi-Person 2D Human Pose Estimation using Part Affinity Fields, CVPR 2017 Oral
  • Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image [108]. This method first predicts 2D body joint locations and then uses another model called SMPL to create the 3D body shape mesh, which allows it to understand 3D aspects working from 2D pose estimation. The 3D mesh is capable of capturing both pose and shape, versus previous methods which could only find 2D human pose. The authors provide an excellent video analysis of their work here: : Video [109]
SMPLify: 3D Human Pose and Shape from a Single Image (ECCV 2016)

We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D” [110].


As mentioned, a previous section presented some examples of reconstruction but with a general focus on objects, specifically their shape and pose. While some of this is technically reconstruction, the field itself comprises many different types of reconstruction, e.g. scene reconstruction, multi-view and single view reconstruction, structure from motion (SfM), SLAM, etc. Furthermore, some reconstruction approaches leverage additional (and multiple) sensors and equipment, such as Event or RGB-D cameras, and can often layer multiple techniques to drive progress.

The result? Whole scenes can be reconstructed non-rigidly and change spatio-temporally, e.g. a high-fidelity reconstruction of yourself, and your movements, updated in real-time.

As identified previously, issues persist around the mapping of 2D images to 3D space. The following papers present a plethora of approaches to create high-fidelity, real-time reconstructions:

  • Fusion4D: Real-time Performance Capture of Challenging Scenes [111] veers towards the domain of Computer Graphics, however the interplay between Computer Vision and Graphics cannot be overstated. The authors’ approach uses RGB-D and Segmentation as inputs to form a real-time, multi-view reconstruction which is outputted using Voxels.

Figure 13: Fusion4D examples from real-time feed

Note from source: “We present a new method for real-time high quality 4D (i.e. spatio-temporally coherent) performance capture, allowing for incremental non-rigid reconstruction from noisy input from multiple RGBD cameras. Our system demonstrates unprecedented reconstructions of challenging non-rigid sequences, at real-time rates, including robust handling of large frame-to-frame motions and topology changes.

Source: Dou et al. (2016, p. 1) [112]

Fusion4D creates real-time, high fidelity voxel representations which have impressive applications in virtual reality, augmented reality and telepresence. This work from Microsoft will likely revolutionise motion capture, possibly for live sports. An example of the technology in real-time use is available here: Video [113]

Fusion4D: Real-time Performance Capture of Challenging Scenes

For an astounding example of telepresence/holoportation by Microsoft, see here: Video [114]

holoportation: virtual 3D teleportation in real-time (Microsoft Research)
  • Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera [115] won best paper at the European Convention on Computer Vision (ECCV) in 2016. The authors propose a novel algorithm capable of tracking 6D motion and various reconstructions in real-time using a single Event Camera.

Figure 14: Examples of the Real-Time 3D Reconstruction

Note from source: Demonstrations in various settings of the different aspects of our joint estimation algorithm. (a) visualisation of the input event stream; (b) estimated gradient keyframes; © reconstructed intensity keyframes with super resolution and high dynamic range properties; (d) estimated depth maps; (e) semi-dense 3D point clouds. Source: Kim et al. (2016, p. 12) [116]

The Event camera is gaining favour with researchers in Computer Vision due to its reduced latency, lower power consumption and higher dynamic range when compared to traditional cameras. Instead of a sequence of frames outputted by a regular camera, the event camera outputs “a stream of asynchronous spikes, each with pixel location, sign and precise timing, indicating when individual pixels record a threshold log intensity change.” [117]

For an explanation of event camera functionality, real-time 3D reconstruction and 6-DoF tracking, see the paper’s accompanying video here: Video [118]

Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera

This approach is incredibly impressive when one considers the real-time image rendering and depth estimation involved using a single view-point:

We propose a method which can perform real-time 3D reconstruction from a single hand-held event camera with no additional sensing, and works in unstructured scenes of which it has no prior knowledge.

  • Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue [119] proposes an unsupervised method for training a deep CNN for single view depth prediction with results comparable to SOTA using supervised methods. Traditional deep CNN approaches for single view depth prediction require large amounts of manually labelled data, however unsupervised methods again demonstrate their value by removing this necessity. The authors achieve this “by training the network in a manner analogous to an autoencoder”, using a stereo-rig.

Other uncategorised 3D

  • IM2CAD [120] describes the process of transferring an ‘image to CAD model’, CAD meaning computer-assisted design, which is a prominent method used to create 3D scenes for architectural depictions, engineering, product design and many other fields.

Given a single photo of a room and a large database of furniture CAD models, our goal is to reconstruct a scene that is as similar as possible to the scene depicted in the photograph, and composed of objects drawn from the database.

The authors present an automatic system which ‘iteratively optimizes object placements and scales’ to best match input from real images. The rendered scenes validate against the original images using metrics trained using deep CNNs.

Figure 15: Example of IM2CAD rendering bedroom scene

Note : Left: input image. Right: Automatically created CAD model from input.
Note from source: The reconstruction results. In each example the left image is the real input image and the right image is the rendered 3D CAD model produced by IM2CAD. 
Source: Izadinia et al. (2016, p. 10) [121]

Why care about IM2CAD?
The issue tackled by the authors is one of the first meaningful advancements on the techniques demonstrated by Lawrence Roberts in 1963, which allowed inference of a 3D scene from a photo using a known-object database, albeit in the very simple case of line drawings.

While Robert’s method was visionary, more than a half century of subsequent research in Computer Vision has still not yet led to practical extensions of his approach that work reliably on realistic images and scenes.

The authors introduce a variant of the problem, aiming to reconstruct a high fidelity scene from a photo using ‘objects taken from a database of 3D object models’ for reconstruction.

The process behind IM2CAD is quite involved and includes:

  • A Fully Convolutional Network that is trained end-to-end to find Geometric Features for Room Geometry Estimation.
  • Faster R-CNN for Object Detection.
  • After finding the objects within the image, CAD Model Alignment is completed to find the closest models within the ShapeNet repository for the detected objects. For example, the type of chair, given shape and approximate 3D pose. Each 3D model is rendered to 32 viewpoints which are then compared with the bounding box generated in object detection using deep features [122].
  • Object Placement in the Scene
  • Finally Scene Optimization further refines the placement of the objects by optimizing the visual similarity between the camera views of the rendered scene and input image.

Again in this domain, ShapeNet proves invaluable:

First, we leverage ShapeNet, which contains millions of 3D models of objects, including thousands of different chairs, tables, and other household items. This dataset is a game changer for 3D scene understanding research, and was key to enabling our work.

  • Learning Motion Patterns in Videos [123] proposes to solve the issue of determining object motion independent of camera movement using synthetic video sequences to teach the networks. “The core of our approach is a fully convolutional network, which is learnt entirely from synthetic video sequences, and their ground-truth optical flow and motion segmentation.” The authors test their approach on the new moving object segmentation dataset called DAVIS,[124] as well as the Berkeley motion segmentation dataset and achieve SOTA on both.
  • Deep Image Homography Estimation [125] comes from the Magic Leap team, a secretive US startup working in Computer Vision and Mixed Reality. The authors reclassify the task of homography estimation as ‘a learning problem’ and present two deep CNNs architectures which form “HomographyNet: a regression network which directly estimates the real-valued homography parameters, and a classification network which produces a distribution over quantized homographies.

The term homography comes from projective geometry and refers to a type of transformation that maps one plane to another. ‘Estimating a 2D homography from a pair of images is a fundamental task in computer vision, and an essential part of monocular SLAM systems’.

The authors also provide a method for producing a “seemingly infinite dataset”, from existing datasets of real images such as MS-COCO, which offsets some of data requirements of deeper networks. They manage to create “a nearly unlimited number of labeled training examples by applying random projective transformations to a large image dataset”.

  • gvnn: Neural Network Library for Geometric Computer Vision [126] introduces a new neural network library for Torch, a popular computing framework for machine learning. Gvnn aims to ‘bridge the gap between classic geometric computer vision and deep learning’. The gvnn library allows developers to add geometric capabilities to their existing networks and training methods.

In this work, we build upon the 2D transformation layers originally proposed in the spatial transformer networks and provide various novel extensions that perform geometric transformations which are often used in geometric computer vision.

“This opens up applications in learning invariance to 3D geometric transformation for place recognition, end-to-end visual odometry, depth estimation and unsupervised learning through warping with a parametric transformation for image reconstruction error.

3D summation and SLAM

Throughout this section we cut a swath across the field of 3D understanding, focusing primarily on the areas of Pose Estimation, Reconstruction, Depth Estimation and Homography. But there is considerably more superb work which will go unmentioned by us, constrained as we are by volume. And so, we hope to have provided the reader with a valuable starting point, which is to say by no means an absolute.

A large portion of the highlighted work may be classified under Geometric Vision, which generally deals with measuring real-world quantities like distances, shapes, areas and volumes directly from images. Our heuristic is that recognition-based tasks focus more on higher level semantic information than typically concerns applications in Geometric Vision. However, often we find that much of these different areas of 3D understanding are inextricably linked.

One of the largest Geometric problems is that of simultaneous localisation and mapping (SLAM), with researchers considering whether SLAM will be in the next problems tackled by Deep Learning. Skeptics of the so-called ‘universality’ of deep learning, of which there are many, point to the importance and functionality of SLAM as an algorithm:

Visual SLAM algorithms are able to simultaneously build 3D maps of the world while tracking the location and orientation of the camera.” [127] The geometric estimation portion of the SLAM approach is not currently suited to deep learning approaches and end-to-end learning remains unlikely. SLAM represents one of the most important algorithms in robotics and was designed with large input from the Computer Vision field. The technique has found its home in applications like Google Maps, autonomous vehicles, AR devices like Google Tango [128] and even the Mars Rover.

That being said, Tomasz Malisiewicz delivers the anecdotal aggregate opinion of some prominent researchers on the issue, who agree “that semantics are necessary to build bigger and better SLAM systems.” [129] This potentially shows promise for future applications of Deep Learning in the SLAM domain.

We reached out to Mark Cummins, co-founder of Plink and Pointy, who provided us with his thoughts on the issue. Mark completed his PhD on SLAM techniques from Oxford University:

The core geometric estimation part of SLAM is pretty well solved by the current approaches, but the high-level semantics and the lower-level system components can all benefit from deep learning. In particular:
Deep learning can greatly improve the quality of map semantics — i.e. going beyond poses or point clouds to a full understanding of the different kind of objects or regions in the map. This is much more powerful for many applications, and can also help with general robustness (for example through better handling of dynamic objects and environmental changes).
At a lower level, many components can likely be improved via deep learning. Obvious candidates are place recognition / loop closure detection / relocalization, better point descriptors for sparse SLAM methods, etc
Overall the structure of SLAM solvers probably remains the same, but the components improve. It is possible to imagine doing something radically new with deep learning, like throwing away the geometry entirely and have a more recognition-based navigation system. But for systems where the goal is a precise geometric map, deep learning in SLAM is likely more about improving components than doing something completely new.

In summation, we believe that SLAM is not likely to be completely replaced by Deep Learning. However, it is entirely likely that the two approaches may become complements to each other going forward. If you wish to learn more about SLAM, and its current SOTA, we wholeheartedly recommend Tomasz Malisiewicz’s blog for that task: The Future of Real-Time SLAM and Deep Learning vs SLAM [130]

Follow our profile on medium for the next instalment — Part 4 of 4: ConvNet Architectures, Datasets, Ungroupable Extras.
Please feel free to place all feedback and suggestions in the comments section and we’ll revert as soon as we can. Alternatively, you can contact us directly through: info@themtank.com

The full piece is available at: www.themtank.org/a-year-in-computer-vision

Many thanks,

The M Tank

References in order of appearance

[92] Rezende et al. 2016. Unsupervised Learning of 3D Structure from Images. [Online] arXiv: 1607.00662. Available: arXiv:1607.00662v1

[93] ibid

[94] ibid

[95] Pose Estimation can refer to either just an object’s orientation, or both orientation and position in 3D space.

[96] Riegler et al. 2016. OctNet: Learning Deep 3D Representations at High Resolutions. [Online] arXiv: 1611.05009. Available: arXiv:1611.05009v3

[97] Xiang et al. 2016. ObjectNet3D: A Large Scale Database for 3D Object Recognition. [Online] Computer Vision and Geometry Lab, Stanford University (cvgl.stanford.edu). Available from: http://cvgl.stanford.edu/projects/objectnet3d/

[98] Choy et al. 2016. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. [Online] arXiv: 1604.00449. Available: arXiv:1604.00449v1

[99] ibid

[100] Gadelha et al. 2016. 3D Shape Induction from 2D Views of Multiple Objects. [Online] arXiv: 1612.058272. Available: arXiv:1612.05872v1

[101] ibid

[102] Rezende et al. 2016. Unsupervised Learning of 3D Structure from Images. [Online] arXiv: 1607.00662. Available: arXiv:1607.00662v1

[103] Colyer, A. 2017. Unsupervised learning of 3D structure from images. [Blog] the morning paper. Available: https://blog.acolyer.org/2017/01/05/unsupervised-learning-of-3d-structure-from-images/ [Accessed: 04/03/2017].

[104] COCO. 2016. Welcome to the COCO 2016 Keypoint Challenge! [Online] Common Objects in Common (mscoco.org). Available: http://mscoco.org/dataset/#keypoints-challenge2016 [Accessed: 27/01/2017].

[105] ECCV. 2016. Webpage. [Online] European Convention on Computer Vision (www.eccv2016.org). Available: http://www.eccv2016.org/main-conference/ [Accessed: 26/01/2017].

[106] Cao et al. 2016. Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. [Online] arXiv: 161108050. Available: arXiv:1611.08050v1

[107] Zhe Cao. 2016. Realtime Multi-Person 2D Human Pose Estimation using Part Affinity Fields, CVPR 2017 Oral. [Online] YouTube.com. Available: https://www.youtube.com/watch?v=pW6nZXeWlGM [Accessed: 04/03/2017].

[108] Bogo et al. 2016. Keep it SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image. [Online] arXiv: 1607.08128. Available: arXiv:1607.08128v1

[109] Michael Black. 2016. SMPLify: 3D Human Pose and Shape from a Single Image (ECCV 2016). [Online] YouTube.com. Available: https://www.youtube.com/watch?v=eUnZ2rjxGaE [Accessed: 04/03/2017].

[110] ibid

[111] Dou et al. 2016. Fusion4D: Real-time Performance Capture of Challenging Scenes. [Online] SamehKhamis.com. Available: http://www.samehkhamis.com/dou-siggraph2016.pdf

[112] ibid

[113] Microsoft Research. 2016. Fusion4D: Real-time Performance Capture of Challenging Scenes. [Online] YouTube.com. Available: https://www.youtube.com/watch?v=2dkcJ1YhYw4&feature=youtu.be [Accessed: 04/03/2017].

[114] I3D Past Projects. 2016. holoportation: virtual 3D teleportation in real-time (Microsoft Research). [Online] YouTube.com. Available: https://www.youtube.com/watch?v=7d59O6cfaM0&feature=youtu.be [Accessed: 03/03/2017].

[115] Kim et al. 2016. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event Camera. [Online] Department of Computer, Imperial College London (www.doc.ic.ac.uk). Available: https://www.doc.ic.ac.uk/~ajd/Publications/kim_etal_eccv2016.pdf

[116] ibid

[117] Kim et al. 2014. Simultaneous Mosaicing and Tracking with an Event Camera. [Online] Department of Computer, Imperial College London (www.doc.ic.ac.uk). Available: https://www.doc.ic.ac.uk/~ajd/Publications/kim_etal_bmvc2014.pdf

[118] Hanme Kim. 2017. Real-Time 3D Reconstruction and 6-DoF Tracking with an Event. [Online] YouTube.com. Available: https://www.youtube.com/watch?v=yHLyhdMSw7w [Accessed: 03/03/2017].

[119] Garg et al. 2016. Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue. [Online] arXiv: 1603.04992. Available: arXiv:1603.04992v2

[120] Izadinia et al. 2016. IM2CAD. [Online] arXiv: 1608.05137. Available: arXiv:1608.05137v1

[121] ibid

[122] Yet more neural network spillover

[123] Tokmakov et al. 2016. Learning Motion Patterns in Videos. [Online] arXiv: 1612.07217. Available: arXiv:1612.07217v1

[124] DAVIS. 2017. DAVIS: Densely Annotated Video Segmentation. [Website] DAVIS Challenge. Available: http://davischallenge.org/ [Accessed: 27/03/2017].

[125] DeTone et al. 2016. Deep Image Homography Estimation. [Online] arXiv: 1606.03798. Available: arXiv:1606.03798v1

[126] Handa et al. 2016. gvnn: Neural Network Library for Geometric Computer Vision. [Online] arXiv: 1607.07405. Available: arXiv:1607.07405v3

[127] Malisiewicz. 2016. The Future of Real-Time SLAM and Deep Learning vs SLAM. [Blog] Tombone’s Computer Vision Blog. Available: http://www.computervisionblog.com/2016/01/why-slam-matters-future-of-real-time.html [Accessed: 01/03/2017].

[128] Google. 2017. Tango. [Website] get.google.com. Available: https://get.google.com/tango/ [Accessed: 23/03/2017].

[129] ibid

[130] Malisiewicz. 2016. The Future of Real-Time SLAM and Deep Learning vs SLAM. [Blog] Tombone’s Computer Vision Blog. Available: http://www.computervisionblog.com/2016/01/why-slam-matters-future-of-real-time.html [Accessed: 01/03/2017].