An Introduction to the Airtime Media Architecture

Tech @ Airtime
Airtime Platform
Published in
6 min readMar 8, 2019

by Jim Wong

In previous posts, we’ve talked about some of the techniques our application and devops teams have used to put together a flexible and scalable back-end to support the Airtime application. For a change of pace, we’d like to dive into the more specialized software that makes the group video chat features of the application work. We’ll talk about some of the challenges we’ve faced and the architecture we’ve built to address them, and followup articles will dig into some of the more interesting aspects of the system in greater detail.

The Challenge of Real Time Video

Video on the Internet is pervasive. By some measures, video accounts for a staggering 70% of Internet traffic in North America during peak hours. People watch hundreds of millions of hours of video on YouTube every day. With all that video flying around, it might be tempting to think that it’s a solved problem: when you can catch an NFL game live on Amazon Prime Video or instantly stream a live feed to your friends on Facebook, how hard can video chat be? As it turns out, it’s actually really hard.

Live video on the Internet is typically distributed through Content Delivery Networks (CDNs), large networks of computers located around the world that take video data from its point of origin and pass it along until it reaches a server close to the viewer. This approach is highly effective when it comes to delivering video to thousands or millions of viewers, because it allows the associated video to fan out from its source to a much larger number of geographically distributed servers, and no single server is required to bear the brunt of the entire audience. However, each server the stream passes through within the CDN adds delay, so viewers see what’s happening sometime after it actually happened, in some cases by as much as tens of seconds.

These live video streams must also accommodate a wide range of network conditions, from barely adequate and highly variable 2.5G mobile networks to rock-solid 100 Mbps home fiber connections. As anyone who has ever waited impatiently for a stream to start while staring at a spinner knows, changes in network conditions can be papered over via buffering, which allows a viewer to start downloading a stream before it starts playing, hopefully giving the network enough of a head start that the viewing experience won’t be interrupted if a hiccup occurs, and some part of the stream can’t be retrieved right away. Of course, buffering also adds delay — in fact, that’s really all it is — so the viewer’s experience is further removed from the action that’s taking place in real time.

Thanks to these techniques, it’s relatively easy to distribute one-way video streams via CDNs, but at the cost of significant delay. This is fine if all you want to do is sit back and watch a YouTube stream, but intolerable if you’re trying to have a two-way conversation. Large delays lead to awkward pauses and cross-talk, as neither side knows when the other is done talking. This is the crux of the difference between streaming video and real-time video: whereas streaming applications can easily tolerate large delays, effective real-time communication requires latencies in the neighborhood of tens of milliseconds, a thousand-fold difference. Consequently, the server infrastructure, management tools, and network protocols for real-time applications are radically different from those used for simple streaming.

A Multiplicity of Devices

Further complicating matters is the tremendous variety of devices people carry around every day. You might have the latest and greatest flagship phone, but odds are that not all of your friends and family do. Encoding and decoding video make heavy use of your device’s processing resources, so the differences in horsepower between high-end and bargain-basement phones can have real effects in terms of the quality of video they can support.

These differences can easily be mitigated for simple two-person calls: the callers’ devices can exchange information regarding their capabilities and mutually decide on a configuration that works well for each of them. The situation is much more complicated for multi-person video chat, however. In scenarios in which two or more powerful devices are participating in a session with one that’s significantly less capable, falling back to the capabilities of the weak device needlessly shortchanges the users with more powerful phones.

Making this work well for everyone requires a solution that allows powerful devices to send and receive high-quality streams, while simultaneously ensuring that less powerful devices aren’t overwhelmed with high-resolution video they can’t handle. Additionally, this must be done at a reasonable cost, which precludes using back-end resources to create bespoke versions of every stream for every participant in a conversation.

Toward a Real-Time Delivery Network

At Airtime, we’ve spent the last few years working to solve these problems, with the goal of creating a real-time delivery network that can deliver high-quality, real-time video around the world and seamlessly accommodate both the broad range of devices our users own and the diverse networks they’re connected to.

In doing so, we’ve also had to address the challenge of delivering these services in a way that is scalable and cost-effective. Processing video is computationally expensive, so we’re forced to eschew the popular web frameworks typically used to run large-scale sites, instead relying on an old standby: C++, which also happens to be the implementation language of the WebRTC framework that underlies our service. This approach gives us native performance and easier integration with the framework code. By prioritizing efficiency and minimizing external dependencies, we’ve put together a system that will support us as we grow to millions of users and beyond without breaking the bank.

To make all this work, we’ve built a number of independent components, including:

  • A high-performance media server that distributes streams and continuously tailors them to the needs of users and their devices. Written in C++ to maximize scalability, this service is built on top of the open-source WebRTC project and ensures that we can deliver real-time audio and video to users regardless of the type of devices they are using and the conditions of the networks they’re connected to.
  • Native client frameworks for iOS and Android that allow our apps to publish and subscribe to video using our infrastructure, built on a customized and tuned port of the WebRTC stack.
  • A corresponding JavaScript framework that allows web apps to fully participate in Airtime calls using browsers’ built-in WebRTC implementations.
  • A stream management service that enables clients to publish, discover, and subscribe to new streams.
  • Globally distributed systems to support the discovery and allocation of media servers, working in concert with real-time monitoring of audio and video performance.
  • Automated systems to bring up additional capacity as needed and to manage the deployment of new software releases.

From 50,000 feet, the system looks a little like this:

As you can imagine, there’s more to it than we can cover in this post, so we’ll dig into parts of the system in more detail down the road a bit. Stay tuned!

In the meantime, if this sounds interesting to you, check out our open engineering roles here: https://grnh.se/cdc2a0401.

Originally published at https://techblog.airtime.com on March 8, 2019.

--

--