Simple algorithm to remove moving objects from pictures
The hidden power of the median filter
A common pre-processing step for image processing (or signal processing) is noise reduction. Probably the two most common filters to remove noise are median and mean. They are know as high pass filters, since they are especially good on removing high frequencies from signals. In images, high frequency noises are not common in modern digital cameras, but quite common on volumetric images (such as CTs, MRI, etc). They are informally called as salt and pepper noise. In this article, we will explain the median filter and show how it can be adapted for a new usage: remove moving objects from a stack of images.
Median Filter Basics
The main idea of the median filter is to have a sliding adjacency window over the image. This window will go over the entire image and each pixel of the image will be in the central position once. Its size can vary and each size has different applications, for this article let's consider a 8-neighborhood window (to be more precise, its a adjacency window with maximum radius of √2, since the center of the each diagonal pixel will be at 1.41 pixels away from the center of the window). Only pixels inside the image domain are considered for the window, so the very first pixel only has 4 neighbors. Pixels with white background or outside the image domain are not being considered for this iteration, the darkest pixel is the center of the window and pixels with gray background are adjacent to the center.
Consider the 5x5 8-bit grayscale image above, with the value of each pixel within [0, 255]. The image also has some noise, some pixels with value 0 and 255 randomly scattered across the image. They are considered noise, since the majority of the pixels has value 128, so 0 and 255 are outliers in this sample of numbers. The algorithm starts allocating a new matrix with same size of the original image, to use as output image. For each pixel of the image, compute the median of the values in the adjacency window and place this result in the same position of the output image. At some point of the execution, the adjacency window will be over the first noisy pixel. Just like the picture below:
We can see that the adjacents of the central pixel are: [128, 128, 128, 128, 0, 255, 128, 128]. Including the central pixel itself, we have an adjacency array of 9 values. Since we are interested in the median, we need to sort this array and get the central value. In this case, the sorted array is:
[0, 0, 128, 128, 128, 128, 128, 128, 255]
We put the median of this array (128) in its corresponding position (row 1, column 1, in this case), of the output matrix. Note that by sorting the array, outliers were placed in the extremities of the array, thus computing the median effectively replaces the noise with a more common value of that neighborhood.
Extrapolating the idea behind median filter
Moving cars in a street, people walking in front of a landmark, or any others moving objects are "outliers" compared to all the other immovable objects from a scene, given that you have enough data to know what is outlier and what is not. So in order to remove moving objects, you need to capture several images from the precise same region of the space (e.g. a camera over a tripod). Place the camera and set a timer to take images, say every 10–20 seconds. The interval between each image depends on the speed of the moving objects. After acquiring the images, we can apply the same idea behind the median filter, however, instead of sliding a window over an image, we iterate over all the images at the same time. The adjacency array will be formed with the values of a given (x, y) position of every image. This means that the size of the adjacency array will be the same number of pictures you have captured. For example, in the image below, the adjacency array in the first iteration (first pixel) will have all the values painted in gray.
After building the adjacency array, we compute the median value and place in the same position of the output image, just like the regular median filter algorithm.
A few real life examples to illustrate the expected result:
In the set of images above, each image was taken 20 seconds apart and a tripod was used. After executing the adapted median filter algorithm, the results should look like this:
Note that in the "I Amsterdam" result image, there are two people on top of the letters s and t. They remained in a similar position during the image acquisition and were not removed since they became part of the "middle value" for that region of space.
So now you may be wondering what happens if you don't use a tripod. Well, if the images represent different regions of the space, the algorithm will produce a shaky result, almost like a painting, which can also be quite interesting.
If you want to avoid using the tripod during acquisition, you can also do a pre-processing step usually referred to Image Registration, that will find affine transformation parameters (scale, translation, rotation, etc), to position the images into the same region of the space. This would solve the shakiness of the result, however, these are much more complex techniques that are out of the scope of this article.
The code used to generate the images for this article is available here: https://github.com/nmoya/median-stack-filter
Feel free to share your generated images in the comments!
 Gifs may not work on mobile devices. Can't figure it out why.