Working with the 3D Camera on the Samsung S10 5G

Luke Ma
The Startup
Published in
8 min readNov 1, 2019

“Say, that’s a nice new Samsung S10 5G device your user has got there,” he said with the least amount of subtlety he could muster. “It would be a shame if a cloud-based video conferencing service that also provided a great Android app didn’t use the 3D camera to blur the user’s background so that it provides more privacy,” he threatened. I personally think we should all cower before imaginary techno-mobsters so let’s dive right in.

Background (Pun Intended)

The concept of a “Privacy Mode” or background blur is pretty well understood. The visual effect is similar to bokeh but the business value is privacy, preventing leaking information, and overall visual ambience. Something like this:

The key is generating a mask that separates the areas you want to blur and the areas you don’t. Intuitively, if you knew the distance of each pixel in the image you could do generate this mask but distance isn’t the only approach. You could also use a trained neural network to distinguish foreground and background without any distance information at all. But that’s a different post.

We’re here to play around with the 3D Camera (hereafter ToF camera) on the Samsung S10 5G. Why? Because it’s there and it’s useful to evaluate all the tools at your disposal. The example app/code I used for this post is available on GitHub.

What is Time of Flight?

Time-of-Flight technology refers to measuring distance to a point by tracking the time it takes for a beam of light to travel to that point. Speed of light is constant so once you have the time, you also have the distance. A Time-of-Flight Camera is a system that can track distance over a sensor area using the Time-of-Flight principle. There are different ways of figuring out the elapsed time(the S10 5G uses phase-shift detection on an infrared carrier wave, 940nm iirc) but the fundamental theory remains the same. There are pros and cons of this approach versus other popular approaches (e.g. Structured Light as used in Apple’s True Depth Camera) but for our purposes, it’s just another source of distance data.

The ToF Camera

The front-facing ToF sensor on the Samsung S10 5G is a Sony IMX316. It outputs frames in the DEPTH16 image format with a resolution of 240x180. It has a 75° field of view, which roughly matches the S10 5G’s front-facing camera’s field of view of 80°.

Watch out: The S10 5G (and Note10+ 5G as well) returns two cameras through the Camera2 API. Both the cameras are actually derived from the same sensor and the 6.5MP camera is just a crop of the 10MP camera. If you want to actually implement the mask yourself, make sure to use frames from the 10MP camera.

You can find the ToF camera through CameraCharacteristics. Here’s an example:

for (String camera : cameraManager.getCameraIdList()) {
CameraCharacteristics chars = cameraManager.getCameraCharacteristics(camera);
final int[] capabilities = chars.get(CameraCharacteristics.REQUEST_AVAILABLE_CAPABILITIES);
boolean facingFront = chars.get(CameraCharacteristics.LENS_FACING) == CameraMetadata.LENS_FACING_FRONT;
boolean depthCapable = false;
for (int capability : capabilities) {
boolean capable = capability == CameraMetadata.REQUEST_AVAILABLE_CAPABILITIES_DEPTH_OUTPUT;
depthCapable = depthCapable || capable;
}
if (depthCapable && facingFront) {
// Note that the sensor size is much larger than the available capture size
SizeF sensorSize = chars.get(CameraCharacteristics.SENSOR_INFO_PHYSICAL_SIZE);
Log.i(TAG, "Sensor size: " + sensorSize);

// Since sensor size doesn't actually match capture size and because it is
// reporting an extremely wide aspect ratio, this FoV is bogus
float[] focalLengths = chars.get(CameraCharacteristics.LENS_INFO_AVAILABLE_FOCAL_LENGTHS);
if (focalLengths.length > 0) {
float focalLength = focalLengths[0];
double fov = 2 * Math.atan(sensorSize.getWidth() / (2 * focalLength));
Log.i(TAG, "Calculated FoV: " + fov);
}
return camera;
}
}

Once you have the camera, you can open it like any other camera. Since DEPTH16is not a great format for a direct preview, we’ll want to attach an ImageReader to a preview session and read frames from there directly.

Extracting Range Information

Once you have an image of DEPTH16 format, each pixel will give you both a range (distance) and a confidence measure. The DEPTH16 documentation tells you exactly what to do but here is an example of generating an int[]mask based on an Image.

private int[] getDepthMask(Image image) {
ShortBuffer shortDepthBuffer = image.getPlanes()[0].getBuffer().asShortBuffer();
int[] mask = new int[WIDTH * HEIGHT];
for (int y = 0; y < HEIGHT; y++) {
for (int x = 0; x < WIDTH; x++) {
int index = y * WIDTH + x;
short depthSample = shortDepthBuffer.get(index);
int newValue = extractRange(depthSample, 0.1);
mask[index] = newValue;
}
}
}

private int extractRange(short sample, float confidenceFilter) {
int depthRange = (short) (sample & 0x1FFF);
int depthConfidence = (short) ((sample >> 13) & 0x7);
float depthPercentage = depthConfidence == 0 ? 1.f : (depthConfidence - 1) / 7.f;
return depthPercentage > confidenceFilter ? depthRange : 0;
}

You can try to filter out higher confidence levels but for the privacy blur feature, I found that it was better to let all confidence values through (except 0) and then do a bit of signal processing afterwards. Setting the confidence minimum higher reduces overall noise somewhat but removes too much useful information.

Visualizing Range Information

I have a bug in my brain where I can’t easily visualize an int[] to save my life. I need that #tradlife ARGB. So let’s convert the mask to something that looks good!

The approach here is to simply normalize the range to values between 0 and 255 and then assign that to the green channel of an ARGB pixel. Since I only really care about a section of the foreground, I’m going to clamp the ranges to an arbitrary min/max value and then scale everything else down. (In a real implementation, a FaceDetection routine would be useful as a way to hone in on an area of the mask to establish your min/max.) Here’s an example:

private int normalizeRange(int range) {
float normalized = (float)range - RANGE_MIN;
// Clamp to min/max
normalized = Math.max(RANGE_MIN, normalized);
normalized = Math.min(RANGE_MAX, normalized);
// Normalize to 0 to 255
normalized = normalized - RANGE_MIN;
normalized = normalized / (RANGE_MAX - RANGE_MIN) * 255;
return (int)normalized;
}

Once normalized, simply create a bitmap and loop through and assign the colors:

private Bitmap convertToRGBBitmap(int[] mask) {
Bitmap bitmap = Bitmap.createBitmap(WIDTH, HEIGHT, Bitmap.Config.ARGB_4444);
for (int y = 0; y < HEIGHT; y++) {
for (int x = 0; x < WIDTH; x++) {
int index = y * WIDTH + x;
bitmap.setPixel(x, y, Color.argb(255, 0, mask[index],0));
}
}
return bitmap;
}

Once you have the bitmap, you can render it onto a TextureView:

Canvas canvas = textureView.lockCanvas();
canvas.drawBitmap(bitmap, transform, null);
textureView.unlockCanvasAndPost(canvas);

The frame will come out in landscape orientation so make sure to rotate it to fit into the view with an appropriate Matrix (see example app for details). Once you’ve done all that, you get a preview.

Nice! 🍺 me.

But also VERY NOISY. The jitter from frame to frame is obvious but you can see quite a bit of green pixels (indicating farther distances) around the hair and the sides of my face. If we used this as a blur mask it would look terrible. Let’s smooth it out.

Smooth Jazz

We are going to apply to very basic signal processing techniques to our noisy data. IANASPE but this is just a basic example of what can be done.

The first technique to get rid of the stray green pixels inside the boundaries of my profile is a simple low-pass filter, here a box blur. We can also use a Gaussian blur or any fast, ideally O(n) blurring algorithm you like. (The example app contains an O(n) gaussian blur implementation from here.) This is what it looks like blurred:

Not bad. We’ve traded some loss of detail for less noise within the boundaries. But there’s still quite a bit of jitter from frame to frame. To smooth out those deltas, we can also apply a simple moving average. Here’s the result of applying a moving average (just 3 frames) without blurring:

Much smoother compared to the original frame. Now we just combine it with the blur:

and we have a relatively usable mask. Again, I am not a signal processing engineer so I’m sure there are much better/faster approaches out there but this is usable start.

If you want to play around with this, feel free to take a look at the example app, which shows these approaches side by side.

Privacy Mode

Applying a blur to a camera frame while respecting the depth mask, converting it for preview, encoding it, and sending it out to a reliable real-time video conferencing service includes a whole bunch more work including:

  • Cropping the depth mask to 16:9 if you’re capturing the front camera with a 16:9 aspect ratio.
  • Scaling the depth mask up to match the dimensions of the image, and making sure your upscaling algorithm doesn’t result in a jagged image.
  • Selective blurring with the mask (my approach is scaling down the image to 1/2 width x 1/2 height, applying a blur, and scaling back up, then copying the pixels of the original image back on to the blurred image according to the mask, while applying a mixing gradient for pixels along the edge so that the transition from blurred to unblurred doesn’t look jarring.)
  • Lots of messing around with bytebuffers and YUV/RGB formats.
  • Coordinating startup/shut down of multiple cameras, as well as managing transforms and Textures as you switch in and out of Privacy Mode (rendering through drawBitmap is expensive so you only want to use it when necessary and in addition to IANASPE, IAAlsoNAOpenGLE).

If you do all of that though, you get to make a silly demo video with royalty-free dubstep and iMovie animations like this:

--

--