Image blending with Mask R-CNN and OpenCV

HiuKim Yuen
SoftMind Engineering and Research
4 min readJul 21, 2018


We have been recently consulted by a potential client about an interesting project regarding image blending. In short, the idea is to allow end users to take photos of themselves and blend them into historical photos.

Normally, it would be a pretty straight forward and easy task if users’ photos are taken in a controlled environment. For example, by using green screen background, we can easily extract all the pixels of persons, and then merge it into the target photos with some kind of blending algorithm. However, the challenge of this project is that the photos are taken by end users, possibly in any environment with any background. Immediately, this become a much challenging problem.

Image Segmentation with Mask R-CNN

After some brainstorming, we end up trying out Mask R-CNN¹, which is a deep learning image segmentation technique, to extract persons from the images. There are many well written library with pre-trained models that we can use directly, and we are using this one:

With couple of small modifications on the given sample script, we can easily capture the person. The next step is to extract the relevant pixels and merge them into the target historic photos.

Image blending with OpenCV

OpenCV is a very mature library and contains many out-of-the-box image processing algorithms. For the purpose of this project, we use the SeamlessClone API². To use the SeamlessClone API, we first need to define a mask that cover the source image. In other words, we need to use the result of Mask R-CNN to create a mask. Technically, a mask is black and white pixels, with white pixel indicating the regions that need to merge into the target, while black pixels should be ignored.

Image below is our first attempt.

There are two major problems with the above result. First, the merged person looks too “ghosty”. Looks like the edges are too soft. Second, the color tone obviously doesn’t match.

Regarding the first problem, if we look at the image segmentation result above, we can see that the segmented regions doesn’t really cover the whole person. One obvious problematic region is the hair, which is kind of being cut out. Also, although the body region is kind of perfectly segmented, the blended result isn’t as good as expected because the algorithm blends around the mask peripherals, so a tight mask will cause the body edges to fade out too quickly, thus making it looks “ghosty”. To fix this, we try to dilate the mask using OpenCV dilation API³.

With mask dilation, the person looks much more solid. Then, we need to tune the color. Again, there are a lot of color tuning algorithms out there which we will not go into details in this article. For this specific historic black-and-white image, it turns out that by simply turning our source image into grey scale has already done the trick. I’m sure there could be a lot of improvement along the pipeline that could be done, but for now, this result looks acceptable to us.

More Results

To test out more images, we have tried to grab some random images from the Internet (more specifically, we want to try images with group of people). Below are two examples.


This is a very interesting project for us to try out Mask R-CNN, which is really impressive. The image segmentation result is unbelievably good. The same pipeline could potentially be used in other image processing projects, not only limited to extracting people.

One concern we have though is the processing time. The above demo is built and ran in my macbook. The image segmentation part actually takes me more than 10 seconds to run. (People say it could probably be done around a second with a single GPU powered desktop computer, but we haven’t tried). Our ultimate goal is to have this run on mobile device, so that’s probably the next thing we will dive into!