Building a more efficient Background Segmentation model than Google

Published in

Vectorly

6 min readAug 8, 2021

Why Background segmentation?

If you've been following our company, you'll notice we stopped working on Video Vectorization and shifted to AI video upscaling earlier this year.

We thought AI Upscaling could be used as a new form of 'AI-Compression' for video streaming platforms,but after releasing our AI Upscaler we got an unexpectedly large amount of interest from video-conferencing platforms.

It turns out, the video-conferencing industry is starting to broadly adopt AI processing. After talking to our users, we realized, while video quality wasn’t a big priority in video-conferencing, most had Virtual Backgrounds on their product road-maps

Why work a custom model?

When looking into this in May, the main option for doing segmentation in the browser was a library called Bodypix. While it worked well and was open source, we got consistent feedback that it was too slow, and took up too much CPU

We consistently heard that CPU usage is at a premium.

This shouldn't be surprising: Multiple video call participants = multiple video streams to decode = a lot of work on the CPU, so video-conferencing apps are rightly worried about anything adding to the CPU's workload.

Then, in June, Google released the Selfie segmentation model.

With SIMD enabled, this can provide background segmentation using WebAssembly that can achieve much lower CPU usage (~12%)

This is way better than BodyPix, which is why we used the Google Selfie model as the backbone of the first version of our Virtual Background SDK. We've also seen other companies start to use the Google Selfie model to implement virtual background features.

That said, given how much of an issue CPU usage was for our customers, two things kept bothering me:

Web Assembly SIMD is still new, and only available on on the latest versions of Chrome, Firefox and Edge (since July 2021)
12% CPU usage is still high, and if we had built this model ourselves, I'm sure we'd have gotten push back on it

How do you build a better model than Google?

In truth, I don't think we could build a more computationally-efficient model than Google.

That said, if you're going to do AI processing on user devices, you don't have to use the CPU (and arguably, you shouldn't).

Graphics cards (even integrated graphics cards), are designed to work with parallel computations (such as image or video processing). Most of the options for running AI models on edge devices (TF Js, Tensorflow lite, Pytorch mobile) all use a graphics card / GPU backend for good reason — it's more efficient to do Convolutional Neural Networks in the Background.

The problem, though, is that all of these libraries take in data from the CPU, send it to the GPU for processing, and then returns the result to the CPU, adding unnecessary communication to the pipeline.

Whereas what we want is something streamlined, that works entirely in the GPU. Something like this:

This is hugely inefficient, and apparently we're not the only ones to realize this issue. It's such a small fix to avoid the CPU / GPU communication, and yet the TFJS folks and other AI libraries haven't enabled it yet.

So, we fixed this ourselves, and made a build of TensorflowJS that takes in GPU data, and outputs GPU data. The problem is, even after doing all of that, here's the result we got for running the Google Selfie model:

ATensorflowJS, doing all of it's calculations in the GPU, uses more CPU than Web Assembly, which is only doing CPU calculations. I can see why Google doubled down on Media Pipe.

After trying to re-architect TensorflowJS function by function, we realized it just wasn't feasible to make Tensorflow JS CPU-efficient enough for our purposes, so we gave up on any existing frameworks for making this happen.

Getting into the weeds with WebGL

It's not that GPU computation doesn't work. It's just that, TensorflowJS has too much CPU overhead for real-time video communication, even when using a GPU backend.

The solution? Build our own Convolutional Neural Network purely in WebGL, the web API for interacting with graphics cards.

We're no strangers to working in WebGL — in 2020 when we approached potential customers about our vector-based video compression technology, we got pushback that our demos, which were essentially just SVG animations, took too much CPU usage (~20%).

Our vectorization demos (Video converted to SVG animations)

SVG in the browser is 100% CPU based, so we got around it by building our own WebGL based SVG renderer, cutting CPU usage to ~3%.

When we switched to AI Upscaling, we wrote our first Neural Networks in WebGL, and also got pretty good results in terms of performance

It seemed audacious, but we thought— what if we build our own background-segmentation model, from scratch, in WebGL? We could probably get similar CPU numbers, though it'd involve a hell of a lot of work.

So we did do it. Given our experience with the peculiarities of WebGL, we trained our own, custom AI model in Python designed specifically with WebGL in mind. And yeah, we hand-wrote dozens of custom Neural Network layers in C / WebGL / OpenGL

After a bunch of heads down work writing custom shaders, we did it! We built our own custom segmentation model from scratch, entirely in WebGL

As we expected (and honestly, prayed for), our CPU usage was in fact only ~3%, in line with our AI Upscaler.

Even then, most of the CPU usage comes from the overhead of sending the original video stream to the graphics card (texImage2D)

Next Steps

By all accounts, I'm pretty sure we've built the most CPU-efficient background-segmentation library currently available for web-based video conferencing.

While we've proved that you can get ultra-low CPU usage background-segmentation models, we're still not done, as our current proof-of-concept still has a bunch of issues, namely:

We need to do a lot more training, to improve our model quality, especially on edge-cases
Optimizing the WebGL implementation further. Although it's CPU efficient, it's still doing work on your graphics card, and there's still work to be done to get 60fps on older / slower devices.

W̶e̶’̶r̶e̶ ̶h̶o̶p̶i̶n̶g̶ ̶t̶o̶ ̶m̶a̶k̶e̶ ̶t̶h̶e̶s̶e̶ ̶i̶m̶p̶r̶o̶v̶e̶m̶e̶n̶t̶s̶ ̶a̶n̶d̶ ̶r̶o̶l̶l̶-̶o̶u̶t̶ ̶a̶ p̶r̶o̶d̶u̶c̶t̶i̶o̶n̶ ̶r̶e̶a̶d̶y̶ ̶v̶e̶r̶s̶i̶o̶n̶ ̶i̶n̶ ̶t̶h̶e̶ ̶n̶e̶x̶t̶ ̶f̶e̶w̶ ̶w̶e̶e̶k̶s̶.̶

September 4th 2021 Edit: After another month of work and a number of fixes / improvements, we pushed out our beta version of our WebGL background segmentation, with comparable quality to Mediapipe / Selfie .

And of course, the final performance numbers:

Check out the live demo here!

Building a more efficient Background Segmentation model than Google

Written by Sam