Computer Vision for Busy Developers
This article is part of a series introducing developers to Computer Vision. Check out other articles in this series.
Understanding scale space is another one of those major building blocks in Computer Vision. Scale space allows for Scale Invariant feature detectors which allows us to find the same feature across images of the same objects at different scales.
While scale-dependent feature detectors (such as the Harris Corner Detector) will result in a list of features which are composed of an x and y values corresponding to their location in the image. If we want to work with scale invariant features, we need feature points which are not only composed of x and y values, but also defined by a third sigma value (often represented by Greek letter σ). While the x and y values gives us a clue to where the feature is within the coordinate space of the image, the sigma value will give us a clue as to where the feature is within the scale space of the image. In other words, features defined by x, y and σ will allow us to properly find the same feature in images taken at different scales.
Subsampling and Image Pyramids
Because scale information is not inherently available in most images, we need to build up the scale space ourselves. To begin with, we’re going to literally rescale the image to build up our scale space. We can rescale our images using a process called downsampling or subsampling. Subsampling is a process where we take a portion of the original pixels to create a new image that is smaller.
In the scope of our discussions, we are subsampling every other pixel on both the Y and X directions. We skip every other row of pixels and we skip every other column of pixels. This effectively removes 75% of the pixels resulting in an image that is half as wide and half as tall as the original. We can repeat this process multiple times to generate various scales.
This method of downsampling an image over and over again is often referred to as an Image Pyramid. If we were to stack all of the images one on top of the other, it would end up looking a lot like a pyramid. (I’ve also seen some research referencing this as a Resolution Pyramid, but I personally find this term confusing, especially as we get into other pyramid types.)
It’s starting to feel like we’re making progress towards creating the scale space we need for scale invariant features. This method is very simple, but there’s a catch. The issue with subsampling like this is that it generates weird artifacts on the images called alias. These artifacts will interfere with the process of extracting features from the images. These artifacts are easily seen on diagonal lines and appear as “jaggies” along edges.
Unfortunately, the scale space we create using subsampling has some major drawbacks in the aliasing artifacts. Fortunately, we’ve already discussed a technique which allows us to fix this issue. If we were to apply a Gaussian Blur filter to the images before we subsample, the resulting images won’t have the aliasing artifacts we saw earlier.
When we stack up all of these images, we have a Gaussian Pyramid.
SIDE NOTE: Game developers may already be drawing some parallels between image pyramids and mipmap textures generated by the GPU. Mipmaps are image pyramids, but they may not necessarily be using a Gaussian Filter. Mipmaps can use other filters or techniques in their anti-aliasing.
This is another really important concept when it comes to understanding scale space. As a developer, you are already accustomed to the idea of resolution when it comes to images. We often think about it as the number of pixels along the width and height of the image. What if we were to think of resolution as the amount of data along the width and height of the image? How does that change our idea of resolution? With this in mind, we can infer that when we subsample, we remove data by removing pixels and that data is no longer in the image — we are effectively removing 75% of the data in the image by removing 75% of the pixels.
The Gaussian Blur smooths out the image data and spreads it to the neighboring pixels. We then subsample from this filtered image. Because the Gaussian function spread the data of each pixel to its neighbors, when we remove a pixel, we aren’t removing ALL of the data of that pixel. Some of the data of the pixel we are removing stays behind incorporated into the neighbors that are not being removed. Therefore, we are removing less than 75% of the data while we are still removing 75% of the pixels.
Gaussian Scale Space
Let’s take this idea one step further. We’ve already established that removing 75% of the blurred pixels results in the removal of less than 75% of the data. The following illustration shows that the Gaussian Blur image contains very similar amount of data as the one that has been subsampled.
When we look at the differences between the original image and the blurred image, we see a greater loss of data than what we see between the blurred and the subsampled image. What this tells us is that the process of applying the Gaussian Blur also reduces the amount of data in the image. If we are defining resolution as the amount of data in the image, we can conclude that applying a Gaussian Blur effectively reduces the resolution of the image. The reason why this idea is so important is that it gives us much finer control of creating our scale space than the basic Image Pyramid.
Recall that we discussed the Gaussian blur when we first introduced convolutions. We briefly discussed the sigma variable when creating Gaussian kernels. Without going into all of the math, the important thing to note here is that the sigma value drives the “strength” of the blur applied to the image. (In my research,I found a handy Gaussian Kernel Calculator that you can explore here). If the sigma value drives the strength of the blur, and the amount of blur effectively reduces the resolution, we can conclude that we can use the sigma value to reduce the resolution of the image. The larger the sigma, the greater the reduction in resolution. When we convolve our image repeatedly with a Gaussian kernel the result is referred to as the Gaussian Scale Space. To visualize, imagine stacking the blurred image with increasing sigma values. We now have a way to detect features in scale space — that is, we can identify features along x, y and σ (sigma) directions.
SIDENOTE: I know this might come across confusing to developers that are used to working in 3D. Why don’t we use Z for our third coordinate direction? The answer is that Z refers to a coordinate position while σ refers to a scale of a feature. Once we can place a feature in 3D space, we could use 4 values to identify the feature: x, y, z and σ.
SIDENOTE: There are other methods of reducing resolution beyond the Gaussian Blur, such as Anisotropic Scaling. For the purposes of this series, the Gaussian method is simple and leads to other topics we will discuss later.
Optimizing Scale Space
Three is one last topic we should cover before wrapping up Scale Space. The huge benefit when working with Image Pyramids is that after every step change in scale, the image is 75% smaller than the previous step. As we apply other computer vision algorithms to our images to extract or describe features, we have 75% less pixels to worry about at EACH step in our scale. This means our algorithms will process higher scales faster and will require less memory. This is a significant performance optimization that is especially interesting to real-time applications. To that end, It turns out that we have a middle ground between performance optimization and finer control over the scale space.
To achieve this, we’re going to take ideas from the Gaussian Pyramid and the Gaussian Scale Space we discussed earlier. The basic idea is that we start with a Gaussian Pyramid and we reapply Gaussian Blurs in between the subsampling steps. Each time that the image is subsampled, we refer to scales in this space as an octave. The specific sigma used, the number of times the Gaussian Blur is applied and the number of octaves are all variables that we can tweak based on our specific use-case.
A popular configuration I’ve seen often in my research are the values provided by David Lowe (inventor of the SIFT Feature Detector and Descriptor we will discuss later) which is a sigma of 1.6 and 4 steps for each octave.
When working with a scale space that is broken up into octaves, we gain some performance benefits by reducing the number of pixels we need to process while also keeping the benefits of having more control over the increments of scale.
As we continue exploring features and their descriptors, it’s important we have a good grasp on scale space. Much like topics we learned about earlier, Scale Space will come up again as we journey deeper into Computer Vision.
In this article, we discussed various different approaches of building Scale Space as a step towards creating Scale Invariant feature detectors. The Image Pyramid is the simplest method which rescales images using downsampling/subsampling — effectively removing 75% of the pixels with each step in scale. The Gaussian Pyramid, which resembles Mipmap textures game developers are already familiar with, improves on Image Pyramids by removing alias artifacts that are common in Image Pyramids. Conveniently, repeated Gaussian convolutions are an interesting way to reduce the effective resolution of images between pyramid steps allowing us to have greater precision when it comes to scale.
Sources and More Info
- Gaussian Kernel Calculator
- Scale Invariance through Gaussian Scale Space — Blog Post by Jan Sellner
- Digital and Medical Image Processing (Chapter 9) — Book by Twan Maintz
- Pyramids and Scale Space — Lecture by Robert Collins
- Image Sub Sampling — Video by UdaCity
- The SIFT Detector and Descriptor — Lecture from the University of Washington
- Scale Invariant Feature Transform (SIFT) — Video Lecture by Dr. Mubarak Shah
- Detectors and Descriptors — Lecture by Fei-Fei Li
- Distinctive Image Features from Scale-Invariant Keypoints — Academic Research by David Lowe