On-the-Fly Video Rendering with Node.js and FFmpeg

Today, Carvana launched a vehicle valuation tool that uses dynamic video, rendered and streamed on-the-fly, to walk people through the factors contributing to a fair market trade-in value for their cars in a fun and approachable way.

Once given a few details about a vehicle, Carvana retrieves data from various 3rd parties, calculates a value, constructs a timeline from over 100 video clips, and delivers a single video containing all kinds of personalized information.

If you want more fluff, go read the press release. The rest of this post is about the technology behind the scenes.

First, let’s get this out of the way. A variety of languages, tools, and combinations thereof can be used to render video on a server. This post is not about Node.js being the only way or even the best way to do this.

Even though Node is wicked fast, it won’t be as fast as something written in assembly or running on a GPU. Node’s performance, event-driven I/O, and seamless interaction with stdin and stdout as native streams made it a valid choice, but one capability stood out to make it invaluable for a project like this. With the right abstractions, most Node code can be effortlessly packaged to run in the browser with minimal effort. This means this project got a free visual IDE by merely placing a text editor next to a LiveReload-enabled web browser. Sure, I could not interactively drag an item to change its position like a truly “interactive” environment, but I could change its x- and y-coordinates in code, save, and immediately see the result.

The program powering each video segment is a function that takes three primary inputs: the dataset containing values to present in the video; the frame number to render; and a reference to a canvas to draw on.

In a web browser, this outputs the desired frame onto a <canvas> element for visual preview.

On the server, node-canvas provides an identical drawing API that maps to an image manipulation utility called Cairo. The resulting Canvas object can then generate an output stream that can be piped to the file system, a web server response, or in my case, a child process’s stdin.

FFmpeg, the swiss army knife of multimedia, can be spawned as a child process from within Node and configured to wait for data to be provided via stdin. From the shell, it’s more or less like doing this:

$ # in a directory full of cat photos
$ cat *.jpg | ffmpeg -f image2pipe -i - cats.gif

From Node, you can use the child_process native module’s spawn() method to create an object with a writable stream as the .stdin property and a readable stream as the .stdout property. Then to stream multiple frames, you use node-canvas’s Canvas.jpegStream() method, write ‘data’ events to your childProcess.stdin, and intercept the ‘end’ event to start the next frame’s jpegStream. If you simply pipe the stream (“Canvas.jpegStream().pipe(childProcess.stdin)”) the ‘end’ event from the first frame will be passed to stdin and FFmpeg will not receive any further frames. When you are finished with the last frame, you manually close the childProcess.stdin stream to tell FFmpeg the video is complete and to wrap it all up.

It’s not rocket science, but it’s a clever enough composition to be wowed seeing it in action.

Unlike the shell command above, writing to an output file might not be the desired end-game if you want on-the-fly streaming. Just as the child process can take an input stream, the output can be directed to an output stream, stdout.

In Node, we can pipe the output from FFmpeg to disk, in case we get a head start or need to handle client reconnects. The nice thing about Node streams, however, is they’re quite malleable. You can arbitrarily start, stop, fork, join, or sequence streams. Say you take a video stream, split it into blocks, stream each block to a new file, and store the partial contents of the current block in memory. To fulfill a web request, you can stream the contents of each completed file from the file system to the web server’s response object one at a time (omitting the ‘end’ event, like above), then synchronously write the contents of the current block, then fork the FFmpeg child process’s stdout and pipe it to the web server’s response object. The fact this works without missing or corrupting data is outright silly and amazing. There are perks to having a single-threaded event loop!

This post was intended to describe a novel use of a few technologies. Sadly, the path to making a production-ready service is littered with obstacles. FFmpeg requires some combination of luck and witchcraft to configure the perfect composition of parameters to work in non-typical scenarios. Cairo, node-canvas, and Cairo’s dependencies can have problems with the wrong combination of platform or version numbers, especially when trying to use custom fonts. HTML5 video is a garbage fire of different codecs and silent failures if anything isn’t quite right. Scaling and load balancing instances in AWS is easy compared to managing your own hardware, but is still no small task (such as avoiding volume initialization when starting instances based on a snapshot). Monitoring render performance, resource utilization, playback experience, system health, and routing logs and exceptions from ephemeral instances with real-time visualizations and automated self-healing is a zoo of scripts, utilities, and plumbing.

At the end of the day, if you get it all just right, you can plug in some information and within a few seconds begin watching a video made just for you.

Either here or there

Either here or there