Computer Vision Part 2: An Overview

9 min readFeb 26, 2019

In Part 1, we made a juicy case as to why Computer Vision (CV) has reached a level of maturity where it can be exploited across different industries. The mission of this post, which I chose to accept, is to give you a broad overview of all the different "families" within the CV discipline and their respective usage. Although some of those categories have quite the daunting name, we will discover that within this vast CV landscape, thanks to the technological advancements, a lot is already ubiquitous in our everyday world, while other techniques are within reach to be adopted by the masses. One final thing to note, the order in which the CV methods are discussed are also roughly presented by decreasing maturity.

Feature Detection & Matching

As the name imply, feature detection is the detection of region or points of interest on an image. Once those features are detected, they can be matched with another set of features. As we will see later on, in many applications feature detection/matching plays an important role in the earlier steps of an algorithm.

Application 1:

Snapchat's selfie filters will detect regions of interest on your face. Afterwards, a model of an object (filter) is aligned with the selfie's pixels which is basically the Feature Matching part. By interpolating, nearby key-points the algorithm can correct errors it made concerning the mapping of the filter on your face. Finally, your personal filter-shape is adjusted and creates a mesh (a 3D model that can shift and scale with your face). This technique is called Active Shape Modelling and was developed in 1995.

Application 2:

Image alignment is another popular use case, in which we have multiple images and want to merge them into one big panoramic picture.

Here, we can appreciate the non-trivial aspect of feature selection. The selection should encompass the following:

geometric invariance: which can be manifested as translations, rotations or scale
photometric invariance: brightness, exposure
distinctiveness

Those 3 requirements are also found in the first application. Again, once those feature points are found, we can match the multiple images and stitch it into one image.

Recognition/Detection

This is perhaps one of the categories which leans the most heavily on Deep Learning and its recent advancements, which can be broadly split into the following list by decreasing maturity:

face detection/recognition
image classification
object detection
context and scene understanding

Lots of challenges are associated with the recognition/detection of images, which are:

viewpoint variation
scale variation
intra-class variation
image deformation
image occlusion
illumination
background clutter

Let us assume that we all know what face detection/recognition is and immediately jump to image classification.

Image Classification:

Image classification solves the following problem: Given a set of images that are all labeled with a single category, how can we predict these categories for an unseen set of images with high accuracy? This is not an easy task and despite the aforementioned challenges, image classification is widely used. All kinds of online stores use this technique to automatically create categories of their product, while Airbnb has an algorithm which classifies the listing photos of their rental offerings. In public locations, such as markets or airport, congestion can be detected and prevented.

Although image classification has a proven record in terms of performance, sometimes applications need a finer granularity in terms of detection within an image.

Object Detection

Object detection addresses the need to identify multiple objects within the same image. Furthermore, those multiple objects can be of different categories and are typically identified using bounding boxes.

Object detection can be found in autonomous vehicles to detect pedestrians, traffic signs and other vehicles, but it is also present in the manufacturing industry to detect products and defects. Furthermore, it plays a big role in security by detecting anomalies or suspicious packages and can also be used for tracking purposes. Tracking by itself has also many applications within the sports industry and in healthcare where computers can help patients rehabilitate or follow a cell's activity.

Context and scene understanding

Image classification tries to categorize an image while an object detector attempts to detect objects within an image, but both methods have the common lack in understanding about an image. Put differently, both algorithms don't take into account the context which can be a severe limitation in human-computer interaction systems. Below a simple example:

The circled objects have the same shape and look roughly identical no?

We hoomans understand that for the left image the couple doesn't have a mini-car for desert and a pedestrian as drink while on the right one bottle is waiting for the dish to pass by to cross safely… Right? Unfortunately (or, some would say, perhaps for the better) computers still don't have that kind of intelligence to truly understand. It depends on really big amounts of data to train such models, but improvements are being made by mainly companies where image search plays an important role such as at Facebook's Rosetta and Google's Attend, Infer, Repeat.

Segmentation

With image segmentation, we can have a much more granular detection of objects in our images compared to the object detector's bounding boxes. With segmentation, each and every pixel is classified.

On the upper half of the image, we find the output of what respectively an image classifier and object detector would predict. Below, we find the semantic and instance segmentation's output. Indeed, semantic segmentation will assign each pixel to a class but will not distinguish multiple occurrences within the same class, multiple sheep in this case, whereas instance segmentation makes this differentiation and identifies unique occurrences within a category.

Image segmentation is increasingly used in:

detection of tumors & pathologies
pedestrian & brake light detection in autonomous vehicles
satellite imaging recognition
astronomy
manufacturing

Feature-based alignment

As explained earlier, extracting features from images is one of the first steps, and the next stage in many vision algorithms is to match these features across different images. An important component of this matching is to verify whether the set of matching features is geometrically consistent, e.g. whether the feature displacements can be described by a simple 2D or 3D geometric transformation. The computed motions can then be used in other applications such as image stitching, selfie filters, augmented reality, etc.

We will now discuss a particular application of feature-based alignment which is increasingly popular.

Pose Estimation

As the name suggests, Pose Estimation tries to estimate an object’s 3D pose from a set of 2D point projections.

In the movie industry, this is already being used for character animation which can occur real-time. Again, for autonomous vehicles this can be used to detect the alertness of a driver. Also in healthcare, we can detect postural issues as scoliosis, and in farming it is used to detect and prevent disease outbreaks.

Structure from motion

What if we could recreate a 3D model from a video source? This problem is know as structure from motion.

Factorization

This method, introduced in 1992 as a novel technique to recreate a 3D model from a video detects key features, locks on those features as the motion happens, creating a feature motion stream and from this motion stream recreates a 3D model. Below we see a simple example:

The features are dots on our rotating ball, we lock on the dots and analyse the flow. From this we can recreate a model.

What if you want to reconstruct something bigger, let's say The Great Wall of China? Apparently, it is quite hard to move it around. In this case, you fly over/by it and capture a videostream and from this reconstruct a 3D-model. Indeed, in lots of Augmented Reality applications, factorization is used to virtualize real-life objects such as from museums gallery, Google Maps, internet photos and even Youtube.

Dense motion estimation

This is arguably the oldest commonly used technique, but also perhaps the least known. It finds its applications in video compression, stabilisation and video summarization.

If we take the above sequence of images, we can intuitively understand that some parts within the image stay the same along the sequence. As such, information from one frame can be reused across multiple frames and thus reducing the video size. Furthermore, if noise or other artefacts are present we can average or borrow information from adjacent frames and denoise our video. The list of methodologies is vast for Dense motion estimation, and while the concept is easy to understand, many techniques require quite the knowledge in a broad array of topics. This is partly due to the humongous amount of videos being consumed which according to Cisco will be 80% of total internet consumption.

Computational photography

In a sense, everything which is discussed in this chapter can be seen as computational photography, but here we will discuss old concepts which recently attained new levels in terms of photographic performance thanks to the use of Deep Learning techniques such as CNNs and GANs. If you don't know what CNNs or GANs are that's totally ok, we will see this in another chapter, so until then, see this as an appetizer of what’s to come.

Super Resolution

Super-resolution occurs when images are created with higher spatial resolution and less noise than regular camera images. Before deep learning, the process of aligning and combining several input images resulted in such high-resolution composites. Another popular method was to upscale the image pixel-wise and interpolate pixel values. Then, the researchers at Twitter came with a GAN model which was named Super Resolution GAN and was presented as the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.

Super Resolution was so hot that even the GoogleBrain team came up with their own model:

left: the input of the model, middle: model's prediction and right: groundtruth

Super Resolution finds now its way in satellite image processing, healthcare, microscopy and in astrology.

Colorization

This one is quite self-explanatory and is also getting increasingly robust. How does it work? In short, the semantics (context) of the scene and its surface texture provide ample cues for many colour regions in each image. With this information, it is possible to create a colour classifier on pixel level and produce a plausible colorization that could potentially fool a human observer.

Texture analysis & synthesis

Traditional approaches to texture analysis and synthesis entailed trying to match the spectrum of the source image while generating shaped noise. This was on itself not sufficient and other (complex) techniques were applied with average results at best. Then came Deep Learning to the rescue again with this notable research. Once more GANs (they are a big deal indeed) is the answer. But with GANs the question is more important: "How to have an algorithm that determines precisely whether an image is real or artificially constructed?". If this question can be boiled down to equations and served to a GAN, it will produce results as seen below:

Another popular GAN which is called Pix2Pix can translate one image to another, giving some new powerful tools for human creativity.

Stereo correspondence & Rendering

This is the process of taking two or more images and estimating a 3D model of the scene which happens by finding matching pixels in the images and converting their 2D positions into 3D depths. Again, with Deep Learning there are methods available:

That's pretty much it, for now. Again, the goal was to give a broad overview of the capabilities and applications of computer vision. Furthermore, I am completely aware that this list is not exhaustive at all and that some of you would like to dig deeper in the matter of the how. As such, we're preparing a set of chapters dedicated to some of the aforementioned algorithms in which we will dig deeper into the Deep Learning techniques where Overture is the most passionate about. In the next post, we will broach in an intuitive way the core elements which lay the foundations of modern computer vision algorithms.