“The World is Your Green Screen” — what I’ve learnt from reading the paper
In this article, I will be sharing my takeaways from reading Background Matting: The World is Your Green Screen by Sengupta et.al.
Visual effects in movies make use of a green screen to superimpose computer-generated landscapes into the background. That’s green-screen matting.
Here we will be looking at natural image matting. It’s the same thing but we do away with the green screen. Think Zoom virtual backgrounds, where you get to change your background to hide your messy room but without having to purchase a green screen.
Without a green screen, the task becomes much harder. For one, without a green-screen acting as a contrasting background to the foreground, there may be elements of a background that have similar colour to the foreground. But the green-screen matting relies on the contrast to differentiate between foreground and background.
Alright, lets dive right into natural image matting.
Problem Formulation: The Image Compositing Equation
Given our image, C, how can we separate it into the foreground, F, and background, B, pixel-wise. With α and F, we can then change our background by substituting B with a new image. But what’s the role of α?
Not a segmentation task
In segmentation, we classify pixels deterministically if the pixel is part of the object. If the pixel is part of the object, we assign a value of 1 and if it is not, we assign a value of 0. However, with image matting, the assignment is not binary. The person segmentation in the above image shows how this binary classification will create blocky segmentation masks.
α is continuous
To illustrate this, let’s take for an example an image of this cartoon woman.
And we zoom into that red box.
This is what we get. But in reality, the cameras we have today are not able to capture hair-width pixels. So the photo that our camera captures may instead look like this.
Obviously, our real-life photos are not nearly this bad. But this is just for illustration purposes. Now we compare the same pixel area in the low-resolution capture to the high-resolution image.
We notice that the top-left background pixel (white) has been mixed with the other foreground pixels. So it is in these areas where α takes a value between 0 and 1.
Background Matting: The World is Your Green Screen (Sengupta et.al) proposes a deep learning model that takes an image of a subject against a ‘natural’ background (C) as well as an image of the background without the subject (B) to predict our foreground(F) and alpha matte(α).
Referring back to the composition equation in Figure 1, we can then substitute our predicted foreground( F*), predicted alpha matte(α*) and a new target background, B to get a composite of our subject in front of that new background.
Naturally, our compositing equation will require the background in both photos to be aligned. This means that the photographer must keep absolutely still between both photos. (Impossible.) Luckily, we have something called homography to account for minimal hand movement.
Simply put, homography tracks key points in one image and maps them to the same key points in another image. This map allows us to transform the first image to match the perspective of the second one.
But how do we get the ground truth F and α in the first place? We return to the simpler case of green-screen matting and use current technologies/algorithms like difference matting to extract our ground truth F and α from profile pictures taken of people in front of a green screen.
The Adobe Matting dataset provides 455 ground truth F and α. Sengupta, however, only used a subset of 280 pairs that came from images of non-transparent objects. They then superimposed each foreground onto a bunch of images from MS COCO using the compositing equation (Figure 1).
Disclaimer: I do not have access to the actual dataset. I created the above diagram using the output images from Background Matting.
Note that they’ll end up with the four things they need for a training pass:
- A synthetic composite (of a foreground from the Adobe Matting dataset onto a background from MS COCO)
- Its background without the subject (from MS COCO)
- The ground truth F
- The ground truth α
But wait! Domain Gap!
Have you noticed a problem? We are training on synthetic composites (extracted foreground pasted on new backgrounds) but we want our model to infer on real composites (a camera shot). There’s a domain gap! Essentially, it’s not good for a model to train on one thing but be asked to infer another. Like a classic CNN classifier trained on only a certain breed of cat will definitely have a hard time classifying cats of other breeds.
More specifically, synthetic composites are different in these ways:
Sengupta proposes 3 main things to close this domain gap.
- Data augmentation
- ContextSwitchingBlock — new architecture
- Adversarial loss
As we have previously established, the backgrounds in our synthetic composite will look different from those of our ‘real’ camera shots. Sengupta proposes some preprocessing to the backgrounds in our synthetic composites.
In particular, we generated each B` by randomly applying either a small gamma correction γ ∼ N (1, 0.12) to B or adding gaussian noise η ∼ N (µ ∈ [−7, 7], σ ∈ [2, 6]) around the foreground region. (Section 3.1)
We demarcate this area “around the foreground region” with the help of a segmentation mask. This looks something like this in code:
# segmentation mask of foreground created with Deeplab
m = cv2.imread("0001_masksDL.png")
# dilate the mask to have it extend beyond the actual foreground
kernel = np.ones((5,5),np.uint8)
m1 = cv2.dilate(m,kernel,iterations = 100)
area_to_transform = m1-m
Where the mask here is equal to 1, we apply the gamma correction/add the gaussian noise to our composite image.
Note, however, as for the background image without the subject we do not apply the same transformations. This is only for the image with the subject.
Sengupta leverages existing segmentation networks by additionally feeding the model a ‘soft’ segmentation of the image. The word ‘soft’ comes from applying some morphological transformations (erode > dilate)and a gaussian blur onto the output of a segmentation network (of your choosing).
kernel = np.ones((5,5),np.uint8)
soft = cv2.erode(m,kernel,iterations = 5)
soft = cv2.dilate(soft, kernel, iterations = 10)
soft = cv2.GaussianBlur(soft, (5,5), 5)
The ContextSwitchingBlock basically takes a combination of the images with and without the subject as well as this soft segmentation, idea being that his model will learn to better select cues when trying to classify pixel values/assign α values. For example, around the perimeter of the subject, the model may use more of segmentation and pump up the weights linked to the segmentation mask.
Well… of course this is overly simplified. More accurately, it looks like this:
Briefly speaking, the prior encoders produce a 256 channel activation map of the component (soft segmentation/background)and the selector combines it with that of the image. The combination from both selectors are then combined again with the image and fed into ResBlocks before getting decoded.
(I crossed out the component at the bottom as it is used specifically for video input. Yes, the model works with videos too! But because I have scoped this article to only still images, I took it out of the picture.)
Sengupta also adds a discriminator that will judge if the composite coming out from our model looks like a real composite.
What’s interesting is that he did not use this discriminator with the first model he trained above. This method produced less accurate alpha mattes. He attributes this to the trained weights not being able to change significantly with a discriminator.
Instead the model we trained above (referred to as Gadobe in the paper) takes a teacher role for a duplicate model with randomly initialised weights, (referred to as Greal in the paper) who is the student. This means that Greal will get a loss term from both the discriminator as well as Gadobe.
We got acquainted with the natural image matting problem and learned about Sengupta’s deep learning approach. Sengupta’s method is the first of its kind to use deep learning in a natural image matting problem to predict both the foreground, F and alpha, α. This is as opposed to models such as alphaGAN which only predict α.
Looking ahead, there is still room for improvement in natural image matting solutions. Ideally, we wouldn’t want to have to take a separate background photo without the subject.
Hopefully, what I’ve shared today will give you an easier time reading the paper for yourself or perhaps will even spark the next innovation in natural image matting!