“The World is Your Green Screen” — what I’ve learnt from reading the paper

gordonlim
gordonlim
Apr 14, 2020 · 8 min read

In this article, I will be sharing my takeaways from reading Background Matting: The World is Your Green Screen by Sengupta et.al.

Image for post
Image for post
Background Matting: The World is Your Green Screen (Sengupta et.al.)

Visual effects in movies make use of a green screen to superimpose computer-generated landscapes into the background. That’s green-screen matting.

Image for post
Image for post
Use of green screen in Avengers. (Source: FameFocus)

Here we will be looking at natural image matting. It’s the same thing but we do away with the green screen. Think Zoom virtual backgrounds, where you get to change your background to hide your messy room but without having to purchase a green screen.

Image for post
Image for post
Zoom virtual background. (Source: https://www.zoomvirtualbackgrounds.com/)

Without a green screen, the task becomes much harder. For one, without a green-screen acting as a contrasting background to the foreground, there may be elements of a background that have similar colour to the foreground. But the green-screen matting relies on the contrast to differentiate between foreground and background.

Alright, lets dive right into natural image matting.

Problem Formulation: The Image Compositing Equation

Image for post
Image for post
Figure 1 (Source: Alon Gamliel Gamliel)

Given our image, C, how can we separate it into the foreground, F, and background, B, pixel-wise. With α and F, we can then change our background by substituting B with a new image. But what’s the role of α?

Not a segmentation task

Image for post
Image for post
Source: Reproducing Deep Image Matting by Laurent H.

In segmentation, we classify pixels deterministically if the pixel is part of the object. If the pixel is part of the object, we assign a value of 1 and if it is not, we assign a value of 0. However, with image matting, the assignment is not binary. The person segmentation in the above image shows how this binary classification will create blocky segmentation masks.

α is continuous

To illustrate this, let’s take for an example an image of this cartoon woman.

Image for post
Image for post

And we zoom into that red box.

Image for post
Image for post

This is what we get. But in reality, the cameras we have today are not able to capture hair-width pixels. So the photo that our camera captures may instead look like this.

Image for post
Image for post

Obviously, our real-life photos are not nearly this bad. But this is just for illustration purposes. Now we compare the same pixel area in the low-resolution capture to the high-resolution image.

Image for post
Image for post

We notice that the top-left background pixel (white) has been mixed with the other foreground pixels. So it is in these areas where α takes a value between 0 and 1.

Background Matting

Image for post
Image for post

Background Matting: The World is Your Green Screen (Sengupta et.al) proposes a deep learning model that takes an image of a subject against a ‘natural’ background (C) as well as an image of the background without the subject (B) to predict our foreground(F) and alpha matte(α).

Referring back to the composition equation in Figure 1, we can then substitute our predicted foreground( F*), predicted alpha matte(α*) and a new target background, B to get a composite of our subject in front of that new background.

Wobbly Hands

Naturally, our compositing equation will require the background in both photos to be aligned. This means that the photographer must keep absolutely still between both photos. (Impossible.) Luckily, we have something called homography to account for minimal hand movement.

Image for post
Image for post
Aligning a picture of one book to another using homography. (Source: Learn OpenCV)

Simply put, homography tracks key points in one image and maps them to the same key points in another image. This map allows us to transform the first image to match the perspective of the second one.

Ground Truths

But how do we get the ground truth F and α in the first place? We return to the simpler case of green-screen matting and use current technologies/algorithms like difference matting to extract our ground truth F and α from profile pictures taken of people in front of a green screen.

The Adobe Matting dataset provides 455 ground truth F and α. Sengupta, however, only used a subset of 280 pairs that came from images of non-transparent objects. They then superimposed each foreground onto a bunch of images from MS COCO using the compositing equation (Figure 1).

Image for post
Image for post

Disclaimer: I do not have access to the actual dataset. I created the above diagram using the output images from Background Matting.

Note that they’ll end up with the four things they need for a training pass:

  1. A synthetic composite (of a foreground from the Adobe Matting dataset onto a background from MS COCO)
  2. Its background without the subject (from MS COCO)
  3. The ground truth F
  4. The ground truth α

But wait! Domain Gap!

Have you noticed a problem? We are training on synthetic composites (extracted foreground pasted on new backgrounds) but we want our model to infer on real composites (a camera shot). There’s a domain gap! Essentially, it’s not good for a model to train on one thing but be asked to infer another. Like a classic CNN classifier trained on only a certain breed of cat will definitely have a hard time classifying cats of other breeds.

More specifically, synthetic composites are different in these ways:

Image for post
Image for post

3 things

Sengupta proposes 3 main things to close this domain gap.

  1. Data augmentation
  2. ContextSwitchingBlock — new architecture
  3. Adversarial loss

Perturb B

As we have previously established, the backgrounds in our synthetic composite will look different from those of our ‘real’ camera shots. Sengupta proposes some preprocessing to the backgrounds in our synthetic composites.

In particular, we generated each B` by randomly applying either a small gamma correction γ ∼ N (1, 0.12) to B or adding gaussian noise η ∼ N (µ ∈ [−7, 7], σ ∈ [2, 6]) around the foreground region. (Section 3.1)

We demarcate this area “around the foreground region” with the help of a segmentation mask. This looks something like this in code:

# segmentation mask of foreground created with Deeplab
m = cv2.imread("0001_masksDL.png")
plt.imshow(m)
Image for post
Image for post
# dilate the mask to have it extend beyond the actual foreground
kernel = np.ones((5,5),np.uint8)
m1 = cv2.dilate(m,kernel,iterations = 100)
plt.imshow(m1)
Image for post
Image for post
area_to_transform = m1-m
plt.imshow(area_to_transform)
Image for post
Image for post

Where the mask here is equal to 1, we apply the gamma correction/add the gaussian noise to our composite image.

Note, however, as for the background image without the subject we do not apply the same transformations. This is only for the image with the subject.

ContextSwitchingBlock

Sengupta leverages existing segmentation networks by additionally feeding the model a ‘soft’ segmentation of the image. The word ‘soft’ comes from applying some morphological transformations (erode > dilate)and a gaussian blur onto the output of a segmentation network (of your choosing).

kernel = np.ones((5,5),np.uint8)
soft = cv2.erode(m,kernel,iterations = 5)
soft = cv2.dilate(soft, kernel, iterations = 10)
soft = cv2.GaussianBlur(soft, (5,5), 5)
plt.imshow(soft)
Image for post
Image for post

The ContextSwitchingBlock basically takes a combination of the images with and without the subject as well as this soft segmentation, idea being that his model will learn to better select cues when trying to classify pixel values/assign α values. For example, around the perimeter of the subject, the model may use more of segmentation and pump up the weights linked to the segmentation mask.

Image for post
Image for post

Well… of course this is overly simplified. More accurately, it looks like this:

Image for post
Image for post

Briefly speaking, the prior encoders produce a 256 channel activation map of the component (soft segmentation/background)and the selector combines it with that of the image. The combination from both selectors are then combined again with the image and fed into ResBlocks before getting decoded.

(I crossed out the component at the bottom as it is used specifically for video input. Yes, the model works with videos too! But because I have scoped this article to only still images, I took it out of the picture.)

Adversarial Loss

Sengupta also adds a discriminator that will judge if the composite coming out from our model looks like a real composite.

What’s interesting is that he did not use this discriminator with the first model he trained above. This method produced less accurate alpha mattes. He attributes this to the trained weights not being able to change significantly with a discriminator.

Instead the model we trained above (referred to as Gadobe in the paper) takes a teacher role for a duplicate model with randomly initialised weights, (referred to as Greal in the paper) who is the student. This means that Greal will get a loss term from both the discriminator as well as Gadobe.

Image for post
Image for post

Wrapping up

We got acquainted with the natural image matting problem and learned about Sengupta’s deep learning approach. Sengupta’s method is the first of its kind to use deep learning in a natural image matting problem to predict both the foreground, F and alpha, α. This is as opposed to models such as alphaGAN which only predict α.

Looking ahead, there is still room for improvement in natural image matting solutions. Ideally, we wouldn’t want to have to take a separate background photo without the subject.

Hopefully, what I’ve shared today will give you an easier time reading the paper for yourself or perhaps will even spark the next innovation in natural image matting!

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store