Manipulating video in a browser
Have you ever wanted to modify the video streaming from your webcam?
Now it’s possible!
A new browser API recently landed in the webkit based browsers, it’s called “Insertable Streams for MediaStreamTrack API”. Its superpower is to tap video streams, and allow the application to modify it frame-by-frame, without the need to even display the video on screen.
Before you start experimenting, check out your browser’s support: https://caniuse.com/?search=MediaStreamTrackProcessor
Let’s see a simple, but practical example. We will create a green screen effect ( https://en.wikipedia.org/wiki/Chroma_key ) which means we will swap the green (or blue) background to a certain background image on each frame of the video stream.
Some real world use cases:
Transforming a video stream could be used to extend or modify the stream before sending it further — e.g.: WebRTC.
Possible operations could be:
- cropping or rotating,
- highlighting elements,
- adding various texts or graphical elements to the stream
Additionally, you can also “just” analyze the stream using TrackProcessor to detect something on the frames.
You can also use TrackGenerator to produce a video stream from computer generated content.
Show me the code!
The data flow
Before we dig into the details, check out how simply and elegantly we can use the corresponding browser API.
The MediaStreamTrackProcessor delivers the frames in the form of a “readable-stream”, after which it’s piped through our custom transformer function. The result is then consumed by the MediaStreamTrackGenerator API, which then creates a standard media stream.
But let’s dig deeper a little bit! I’ll show you how to implement a simple green screen effect.
⚠ This example is for demo purposes only. It’s not optimized for high performance!
The transformer function looks like this:
First of all, we have to get the pixel data from the video frames. This can be achieved by using the “copyTo” method. By investigating the video frames, we will see that the format streaming from the webcam is not RGBA, but rather a “I420” format. After a quick investigation, we find that it’s a YUV420 format (https://en.wikipedia.org/wiki/YUV)
Some more explanation may be needed for the result of the “copyTo” method. This provides metadata about the data-buffer structure. In our case, it looks like this:
The YUV format consists of 3 consecutive segments for the Y, U, and V channels in the buffer. The “offset” property outlines the start address of each segment, and the “stride” property informs us of the number of bytes in each row.
An interesting fact: in spite of the RGBA format, where each pixel have a Red, Green, Blue, and an Alpha (opacity) channel, stored in 1–1 separate byte in the buffer, the YUV420 format stores the Y luminance channel as 1 byte per pixel, but both of the chromaticity channels U and V have 1–1 byte for a 2*2 pixel area, halving the (color) resolution in both directions.
Just for fun and experience we will transform the frames to RGBA, then use that format for the rest of the process.
But let’s get back on track and check the transform operations! The green screen effect is a simple condition: when the pixel’s green component is dominant ( G > 0.6 * (R + B) ) it will be swapped with the pixel from the background image.
Some thoughts about the performance…
It’s not a surprise that CPU based video processing is very performance heavy, and the results of the experiment supports this claim as well.
On a higher performance CPU, I had the following frame processing times:
640*360 pixel frames: 4–5 ms
1280*720 pixel frames: ~20 ms
This means that the processing time is near to the available 33 ms frame time at 30fps video stream with a far simple operation. We can easily run out of frame-time!
In the next article ( Video manipulation with WebAssembly | by Szabolcs Damján | Jan, 2022 | Medium ), we will port the transformer function to WebAssembly to compare its performance!