The Ancient Secrets of Computer Vision 1 by Joseph Redmon - Summary
Low, Mid and High Level Computer Vision
Joseph Redmon released a series of 20 lectures on Computer Vision in September (2018). As he is an expert in the field, I wrote a lot of notes while going through his lectures. I am tidying my notes for my own future reference but will post them on Medium also in case these are useful for others.
I highly recommend watching the lectures on Joseph’s Youtube Channel here.
His first lecture is an introduction in which he details the differences, methods and applications of low, mid and high level Computer Vision (CV).
Low level Computer Vision techniques operate on pixel values within images and video frames. These are generally used in the first stages of a CV pipeline as the foundations upon which higher level techniques are implemented.
We resize images every day while we two finger zoom images on our phones or resize browser windows. Resizing algorithms therefore have to be fast and this is often balanced with image quality.
Nearest neighbour sampling for example is fast but tends to be pixelated whereas 2×SaI scaling is much smoother but of course slower. Below is an image scaled with nearest-neighbour scaling (left) and 2×SaI scaling (right).
Converting colour images to grayscale is actually more complex than you might think as human perception of red green and blue vary.
This process is of course low-level as it is performed on a pixel level and is very common as a foundational step in a CV pipeline. Edge detection for example is much faster when applied over a grayscale image and usually at least as accurate.
Images can be manipulated in a post processing stage to alter exposure for example. This is common to tweak photography flaws in photoshop for example.
Image saturation is often modified in house rental images as estate agents must have found that houses with saturated photos sell better.
Hue is often tweaked on TVs and monitors as the hardware is vastly different. The same software on two monitors could show very different images so manufacturers adjust the hue to ensure images look natural.
Edges within images are very commonly extracted for use in higher level processing. Feature detection, such as line detection, is simplified significantly post edge detection as less relevant information in the original image is filtered out.
Like edge detection, oriented gradients improve the performance of object detection by filtering out non-essential information. You can see on the images below that there is a person in the image even though most of the detail has been removed.
Colour Segmentation speeds up image processing as it simplifies images into more meaningful and easier to analyse representations. Typically segmentation is used to locate objects or boundaries.
These low-level Computer Vision techniques are used in photo manipulation, like Photoshop or Instagram filters, and feature extraction which can in turn be used for Machine Learning.
Mid level Computer Vision techniques tie images with other images and the real world. These often build upon low level processing to complete tasks or prepare for high level techniques.
One clear example of combining images with other images is panorama stitching. Additionally we know some real world information, such as how a phone is usually turned, so we can warp and combine multiple images together.
This is another image combination that we can use our knowledge of the outside world to do. As we have two eyes, we see in 3D and perceive depth. By comparing two images we can assess which parts of an image move a lot, and which move less, to judge depth.
Structured Light Scan:
Using patterned light emitters and receivers, we can construct high dimensional models of real world objects from the curvature in the patterns received. This is once again a combination of real world knowledge and multiple images to get the desired result.
As with structured light scans, light emitters can be used alongside cameras to analyse the world. The difference here is that range finding aims to judge the distance between the camera and an object rather than building a 3D model. This is particularly useful for use in self driving cars for example. Emitters are attached to the car and the camera can judge distances from the time it takes the light to reflect back into the camera. Laser light is used, hence commonly called LaDAR and LiDAR (Laser Detection And Ranging and Light Detection And Ranging respectively).
In a similar fashion to creating multi-view stereo images, differences between images can be used for optical flow. Instead of using two images from slightly different positions, frames in a video are used. By comparing which parts of an image have the biggest differences in frames of a video, we can construct flow. This is extremely useful for object tracking (and therefore image tagging) as objects move between frames.
The final mid-level Computer Vision technique Joseph Redmon covered in his lecture was time-lapse creation. This seems like a relatively simple process, in which many frames over time are combined, but is more complex than I assumed. Inconsistencies and variation such as lighting differences, snowfall on one day or objects like cars stopping in front of the camera need to get smoothed out to create a fluid time-lapse.
Some of these mid-level Computer Vision techniques tie images together into a final state. Panorama stitching, time-lapse creation and video stabilisation for example are used for no other reason than to create their output.
Optical Flow however is often a preliminary step to assist object tracking or content-aware resizing as important parts of video frames can be detected.
Computer Vision techniques that are considered high-level bring semantics into the process. Extracting meaning from images is much more complicated but relies heavily on pipelines of low and mid level Computer Vision techniques.
Grouping images into categories is known as Image Classification. The CV pipeline is given an image and is then categorised into a bucket depending on the task. In the case of an emotion detector, the buckets would represent different emotions. Then each image is tagged with the emotion that the pipeline predicts is shown in that image. Another use of this is object detection as shown in the image below.
Extending Image Classification, Image Tagging doesn’t just return what is in the image but also where it thinks it is! This is an important distinction and very difficult to do in a reasonable timescale. Detecting a person is in front of a self-driving car 3 seconds after they step in front of you is of course not fast enough. Joseph Redmon is well known in this field as he developed YOLO (You Only Look Once), an extremely fast image tagging algorithm which you can watch in action below.
Similar to Image Classification is Semantic Segmentation. Building upon Low Level Segmentation, this is essentially classification at a pixel level. It is clear how Optical Flow and Range Finding are used to classify the segments in the image below for example.
Similar to Semantic Segmentation, Instance Segmentation classifies pixels. The difference between the two is that Instance Segmentation can recognise multiple of the same object. For example, illustrated below, Semantic Segmentation classifies chairs, table, building, etc.. In the Instance Segmentation example however, each chair is highlighted separately. This is useful in self-driving cars for example as it can distinguish between multiple vehicles in one image.
If you keep up with the latest tech and research, you will have seen many uses of high-level Computer Vision techniques. We covered autonomous vehicles above but didn’t mention robots. Assistants for the elderly require vision for example to detect falls, retrieve objects or for complex question answering. “Do I need to do laundry?” needs Image Tagging, at least, to answer so future smart homes will increasingly rely on Computer Vision. This question of course contains additional challenges such as the semantic meaning of what is “clean”.
Other uses include game playing. Deep Blue beat the human chess champion in 1996 and since then computers have mastered games like Dota 2. This requires extremely fast image processing to win in real time.
Image retrieval, super resolution, medical imaging and shops like Amazon Go all use high level Computer Vision techniques so this is a booming area of research.
High accuracy is important for many of these uses. Humans have high image processing accuracy while driving, yet many people die on the roads each year, so these systems need to be even more accurate. Similarly, a medical imaging app could cause panic if it returned too many false positives (detecting cancer for example) so research in this area is critical.
Resizing your browser screen, Instagram filters, panorama stitching, 3D model creation and real time Image Tagging to create safer vehicles and homes, we all use Computer Vision everyday. As our technology gets smarter, we increasingly need to improve these CV pipelines with more sophisticated techniques.
In his next lectures, Joseph Redmon covers some of these methods in more detail and explains how Machine Learning, Deep Learning and NLP can improve our pipelines.
In the next lecture, that I will summarise, he talks about Human Vision.