How I Implemented My Own Augmented Reality “Beauty Mode”

Published in

The Startup

10 min readSep 12, 2020

Comparison of original and smoothed skin

I was looking for a personal project to keep me busy during lockdown and decided to have a crack at implementing a Beauty Mode similar to Snapchat and TikTok, that is, a skin smoothing filter that would run in real-time on videos. It was more challenging than I anticipated, here’s a breakdown of how I did it.

How do professionals do it?

‘Improving’ skin in photoshop is common knowledge nowadays and it’s quite easy to find resources on the topic. There are surprisingly many different techniques to make skin look better, most having common grounds: frequency separation.

What we are trying to achieve is to remove the various blemishes from the base picture above, the small variations in shape or colour on the skin, while keeping the overall look similar. Separating the small variations from the large variations can be achieved by separating the high frequency signal, which contains the blemishes, from the low frequency signal in the picture, the smooth skin.

To obtain the so-called high pass, we first apply a Gaussian blur with a radius just large enough to make the undesired small imperfections disappear (e.g. 10 pixels).

Then, we subtract the blurred image from the original one and add a 0.5 grey (128 for 8bits colour depth), that’s our high pass. In Gimp, the grain extract layer mode does exactly that.

For the next steps, we want to somehow attenuate the details from the high pass on the original image, through some layering mode. Let’s overlay the high pass on the normal image, we get a sharper skin with even more visible imperfections, the opposite of what we want.

To get the desired result, we simply invert the high pass colours, now we get a nice smooth skin!

Inverted high pass overlaid on original image

Although.. we also smoothed parts of the image without skin.

Here’s a GLSL snippet to implement the overlay layering mode:

vec4 overlay(vec4 a, vec4 b)
{
  vec4 x = vec4(2.0) * a * b;
  vec4 y = vec4(1.0) - vec4(2.0) * (vec4(1.0)-a) * (vec4(1.0)-b);
  vec4 result;
  result.r = mix(x.r, y.r, float(a.r > 0.5));
  result.g = mix(x.g, y.g, float(a.g > 0.5));
  result.b = mix(x.b, y.b, float(a.b > 0.5));
  result.a = mix(x.a, y.a, float(a.a > 0.5));
  return result;
}

How do we only smooth the skin?

For Photoshop work, an artist would then use brush tools to create a mask and apply the smooth skin only where desired and keep sharp details elsewhere: eyes, hairs, mouth, etc. In our filter, we can’t do that easily.

Luckily, there is a type of blur that can preserve sharp edges such as eyes and hairs: it’s called Surface Blur in Photoshop, Selective Blur in Gimp and Bilateral Filter in the literature. Applying it to our original image by itself gives a result close to what we want out of the box:

Let’s use the same frequency separation technique as above and simply swap the Gaussian blur with the Bilateral Filter:

Frequency separation with bilateral filter

Pretty good! We can mix the raw bilateral filtered image with the result of the frequency separation process to balance between skin smoothness and detail preservation. In the image above, I bring 10% of the former back.

Meh.. can we do better?

Of course we can! Despite all ours efforts, while the skin looks pretty good, we’ve lost quite a few details in other areas that we’d like to keep. Remember, in Photoshop an artist would create a mask to apply the effect selectively.

Let’s try to generate a mask of the skin using some heuristics, based on the colour of the skin.

For that we’re going to use the CIELAB colour space. It’s designed to take into account how humans perceive colours and has convenient properties for what we’re trying to do.

Consider the 3 axes of the colour space with a range [0, 1]. The L axis defines luminance, darker towards 0, lighter towards 1. The a axis represents the amount of red (+1) or green (0) and the b axis the amount of yellow (+1) or blue (0).

Our goal is to mask out pixels with colours that are unlikely to represent skin. We don’t want to be too strict, on a video or photo of the real world, as a number of factors will affect the final colour of the pixels: lighting conditions, lens, exposure and post-processing from the camera. In particular, indoor lighting is generally not neutral and will shift the hue of perceived colour quite a lot.

So rather that aiming at isolating skin colour exactly, we can use some heuristics to reduce the range of colour we’re going to pick based on the probability of that colour representing some skin.

The possible range of human skin colour is well understood, any possible skin tone has a mix of red and yellow tint. That is, these tones are reflected by the skin, while the tones opposite on the colour wheel are essentially absorbed.

“we can use some heuristics to reduce the range of colour we’re going to pick based on the probability of that colour representing some skin.”

If we consider the plane defined by axis a and b in Lab, clearly there is little chance skin will have a significant blue or green tint. We can discard colours with small a values and small b values. To get a smooth mask, we can use a smoothstep operator in a small range around 0.5, the middle of the colour axes.

Let’s now consider the other boundary on the a and b axes, values close to 1 are the most saturated yellow / red. Skin doesn’t reflect that much light (moderate albedo), so we can discard the highest values and reverse smoothstep in the range [0.9, 1.0]. For the same reason, we can attenuate the highest luminance values on the L axis with a reversed smoothstep in the range [0.98, 1.02].

Putting this altogether, we get the following GLSL snippet:

float skin_mask(in vec4 color)
{
  vec3 lab = rgb2lab(color.rgb); # return values in the range [0, 1]
  float a = smoothstep(0.45, 0.55, lab.g);
  float b = smoothstep(0.46, 0.54, lab.b);
  float c = 1.0 — smoothstep(0.9, 1.0, length(lab.gb));
  float d = 1.0 — smoothstep(0.98, 1.02, lab.r);
  return min(min(min(a, b), c), d);
}

The colour space conversion needs to be handled carefully. Photos and videos are typically encoded in sRGB colour space, properly converting to CIELAB looks likes this: sRGB to linear RGB to CIEXYZ to CIELAB.

Now we can use the skin mask to mix the smoothed skin, that looks much better! Note how the hair, eyes and mouth look more natural.

One last thing about colour spaces, generally it’s a good idea to work in linear RGB space when manipulating colours, so we should convert the input image from sRGB to linear RGB, compute all our treatments in linear space, including the blur/filter, and convert the final result back to sRGB.

Ok that looks great, but bilateral filter is slow!

Indeed, if you ever tried the Selective Blur in Gimp, you’ll notice it takes its time to compute. A brute-force implementation of the bilateral filter looks like this pseudo code:

For each pixel m in image:
  result = 0
  For each neighbour pixel n:
    weight = spatial_filter(spacial_distance(m, n))
    weight *= range_filter(range_distance(image(m), image(n)))
    total_weight += weight
    result += weight * image(n)
  filtered_image(m) = result / total_weight

We can use a Gaussian filter kernel for both the spatial_filter() and range_filter() kernel operators, the former would return a weight based on pixel distance in space (coordinates), and the later based on distance in pixel luminance (we can use the L axis of the CIELAB colour space). Both kernels are defined with their respective, fixed, standard variation σ.

In the worst case (large neighbourhood), for each pixel, we would read every other pixel of the image, for a quadratic complexity O(N²)! Even with a moderate neighbourhood of 10 pixels, that’ll take too much time.

Fortunately, researchers found a way to compute a Bilateral Filter.. in real-time! The paper “Real-Time O(1) Bilateral Filtering” by Q. Yang et al. describes a simple technique where Bilateral Filtering can be computed as a linear combination of a constant number of spatially filtered images. Note that O(1) here refers to the complexity of the filter in terms of kernel size, applying the filter to an image yields a linear complexity O(N) with respect to the number of pixels.

Basically, we divide the range of luminance into a constant amount of K image ‘slabs’ (called ‘PBFIC’ in the paper) and apply a spatial kernel filter to them (a.k.a. blur), for which we know an O(1) implementation. The result is obtained by linear interpolation of the filtered slabs based on the luminance of the original image. The pseudo code for this approach looks like this:

For each slab l in range(0, K):
  slab_range = l / (K - 1)
  For each pixel m in image:
    weights(m) = range_filter(range_distance(image(m), slab_range))
  blurred_slabs[l] = gaussian_blur(image * weights) / weights
combine_slabs(blurred_slabs, image)

The combine_slabs() operation linearly interpolates the 2 nearest (in range) images from the array blurred_slabs based on the luminance of the pixel in the original image and the slab_range value computed for each slab.

For our application, we can get good results with 5 layers, a Gaussian kernel of σ=3 as spatial filter (in pixel size) and σ=0.1 as range filter (on CIELAB L axis).

Gaussian kernel filters have known O(1) implementation on both CPU and GPU, and since the number of slabs K is constant, applying such solution on an image consequently offers a linear O(N) complexity. The gaussian_blur() operation can be implemented efficiently on the GPU with the optimised separable Gaussian filter described by Filip Strugar on his Intel blog: https://software.intel.com/content/www/us/en/develop/blogs/an-investigation-of-fast-real-time-gpu-based-image-blur-algorithms.html

An OpenGL implementation

We need to use 11 textures, one for the original image, and two for each slab of the bilateral filter. Multiple GLSL render passes are required, so we should use Framebuffer Objects to hold intermediate results.

Here’s the order of operations, a separate pixel shader is used per pass and rendered on the entire image domain:

Prepare 5 slab textures from the original image converted to linear RGB
Run separable Gaussian blur horizontally on each slab texture
Run separable Gaussian blur vertically on the output of previous step
Combine slabs, apply frequency separation and convert back to sRGB

The Gaussian blur pass can render all slabs at the same time by writing to an array of output FBOs, to which we attach 5 of the allocated textures. That’s why we need two textures per slab, one for the input, one for the output. The slabs can be rendered half resolution without visible degradation for a massive performance boost.

This is still a lot of texture reads, it runs on a 720p video at 60fps and 1080p at over 30fps on an average GPU (Radeon Pro 555). Note that this is with a Python/OpenGL implementation running on OSX, an optimised implementation in C++/Vulkan or Swift/Metal could probably perform much better.

Possible improvements

I tested this video filter on various images, videos and live webcam on different people. The result looks quite robust, the effect noticeably smooth the skin while not producing strange results elsewhere.

If we’re picky, we could consider the skin is smoothed a bit too much, looking like baby skin. To alleviate that, a popular Photoshop technique is blurring the high pass a bit before layering it down with overlay, by just a pixel or two. This has the effect to smooth out the medium sized features but keep the very thin ones. I tried that and got interesting results on sharp photographies, but consumer grade videos from webcams and phones tend to be too blurry to be worth it. We can also simply mix the original image with the result a little to recover some sharpness or apply a subtle sharpness filter to the end result.

Also, the effect can sometimes pick other things than skin where the colour is similar. Wooden hard floor for example will loose fine patterns:

The wood still looks like wood, so it’s not really a problem. If that’s annoying we could restrict the skin smoothing filter to faces, by tracking them with a facial landmark detection technique.. that will be a topic for a future article!