Why we’re coding a MCU from scratch #WebRTC

What we’ve learned after a year doing advanced video at Tribe

Timothée Le Borgne
Inside Tribe
Published in
7 min readJan 23, 2018

--

When we joined Tribe, a year ago, after quite a lot of time spent building 1-to-many HA video backends for people like Dailymotion, Kewego, Afrostream or NomadCast (to name a few), we were faced with a new challenge: build real-time video conferencing capabilities into an async app.

It goes without saying that traditional HLS/Dash streaming was a no-go, given the mandatory delay those techniques imply.

So naturally, we started looking into WebRTC.

Now, setting a proof of concept using WebRTC is really simple. It’s actually a matter of hours; you can get some pretty decent P2P video chat up and running in no time.

Building video to scale, though, is a whole different matter!

We quickly ruled out P2P

The immediate POC option is to consider creating a mesh:

  • Virtually no costs: peers in a call actually connect directly with each other, so no bandwidth costs involved; signaling is super lightweight and can run on small VMs
  • By default, WebRTC takes care of quality control in a crude but efficient manner, maximizes the quality per bandwidth available, so great quality with minimum tweaking
  • Peers automatically adapt their publisher stream’s bitrate/resolution/FPS to match other peer’s network conditions — individually

But while it’s great to iterate on a concept very fast, the downsides absolutely forbid (at least, for us) deploying to production:

  • Individually matching everyone’s quality needs in a n-peers convo means the device has to encode up to n-1 different outbound streams (publishers), which is definitely too CPU-intensive for most devices (mobiles, for Tribe) out there ; especially if n is high. And well, the device also has to decode n-1 inbound streams (listeners)… all of this leading to burning phones, as well as batteries drained in no time
  • No on-the-fly fine-tuning of quality: once the SDP negotiation with the signaling server is done, you can’t modify/force the bitrate/quality given specific conditions (say, you have more peers joining and you want to lower overall quality for publishers, for instance)
  • Can’t record or process the streams server-side, so… limited added-value developments further down the road

Our first approach: setting up a SFU

If you need a WebRTC crash course, head out to bloggeekme and check out the excellent articles about mesh/SFU/MCU, etc.!

Our first concern was to enable many peers calls, on relatively modest devices (iPhones 5). Turns out the best way to do that is by setting up a SFU (Selective Forwarding Unit).

After a quick benchmark of different open source solutions, Janus seemed both the most powerful and flexible one to adapt, so we started customizing it. I.e, bridging our signaling and tweaking settings (a lot).

That gave very satisfying result — as a matter of fact, we’re still using a SFU at the moment:

  • It works fine with up to 8 peers (can actually handle way more in a single room, it’s just a product limitation we set), even on slow devices
  • CPU consumption and battery usage client-side is way better, since there’s only 1 stream to encode (instead of n-1, in a n-peers call)
  • Since a SFU merely forwards any incoming stream to other connected peers, its CPU consumption is very low, so you can have it run on relatively small VMs

The real complex part here is to actually design your video backend so that it scales massively and automatically — this last part is especially important if your audience is worldwide; you do NOT want to wake up in the middle of the night every time there’s a bug/glitch/need for scaling up :)

Without being too specific, ours is mostly made of a CoreOS / Docker Swarm cluster of unit groups, each comprising a signaling server, a Redis Database (for co-turn) and a Janus instance; all of these have a heartbeat container, a CoTURN cluster is available for all unitgroups and respond in STUN/TURN mode. A clustered GoLang micro-service (in-memory database based on protected maps, heartbeats, etc.) acts as both proxy and “orchestrator” and basically:

  • Directs all incoming streams to the correct unit group
  • Checks unit groups’ health, and reboot them if one of the sub-servers somehow failed
  • Monitor & forecast global activity and spawn/remove unitgroups whenever needed
  • Spawn new VMs in the CoreOS cluster if necessary

This setup covers “the basics” (you might want to set up a STUN server as well, to bypass firewalls, and of course, look into making all that stuff global and multi-GEOs, etc.) in a nice and (almost!) worry-free way, and should look like what most other platforms/apps use.

While it’s the most cost-efficient solution for a mainstream audience, we found out it did not entirely match our product ambitions, mostly because of tech limitations:

  • Janus has a ‘lip-sync’ issue (audio/video delay) — probably due to RTP timestamping error. Not the end of the world, but it can be really annoying, as call lasts
  • Streams are merely being forwarded: so if peers have different bandwidth capacities, you’ll need to make a choice: fixed “average” quality settings that should more or less fit everyone in the call, high quality that will look great — but you might end up losing struggling peers — or to the lowest — and it’ll run smoothly everywhere but with degraded picture quality
  • You can’t have a ‘multi-codecs’ room: all peers of a room have to be using the same codec. That’s annoying when you have both iOS/Android users in the same call as both support for H264/VP8 and hardware encoding varies from one device to another…
  • Because a SFU merely forwards and does not decode incoming streams, you can’t ‘modify’ them (like: process audio or video to add fun effects, etc.)

Now switching to building our own MCU

Our main reason for looking into an MCU is product-driven: the Tribe app features a lot of cool stuff including pretty resource-intensive masks, filters, games. Couple this with high quality video encoding — and decoding — and you get yourself a phone that tends to get burning hot pretty fast ;)

So we needed a way to both reduce stream encoding/decoding overhead and maintain cross-device compatibility (iOS/Android/Browser), as well as tweak quality so as to get the best of what all connected devices can handle.

Last but not least, the idea was also to allow for cool/exclusive future product enhancements: server-side face detection/tracking/recognition and machine learning on UGC, masks & filters processed with and burnt to the feed, support for a full-’web’ (using react, for instance) pivot, fun voice-swapping feature, etc.

That’s a set of requirements best met with a MCU (Multi Conferencing Unit). Now, MCUs have been around for a while now, so nothing new here, but we needed something that would be fully integrated with Tribe products — with no overhead whatsoever. Something we’d deeply understand, and would be able to tweak at each and every level — which is why we set out to build our very own MCU, from scratch.

We’ll probably share more about the steps in a coming post, because it’s a long — and somewhat painful — process, with a lot of missteps to avoid, but it’s also exciting and challenging :-)

The MCU ensures all listener streams output with the best possible quality

Contrary to a SFU, a MCU is able to transcode the ingested streams, and output them in a different (or not) quality/format. For us at Tribe, it means that we can:

  • Send a hardware encoded VP8 stream from an Android device, and hardware-decode a H264 transcode of that stream on an iOS device, resulting in a huge performance increase and battery savings, device-wise. So: multi-codecs and hardware acceleration
  • Fine-tune the stream quality: publisher streams (can) get transcoded to exactly match each listener’s max capacity. So basically, connected peers will get always get the highest possible stream quality, whatever the setup
  • Enable stream ‘stitching’. This is especially exciting, since it means you’ll only get one single listener stream to decode, whatever the number of people in the call… and it’s always much easier to decode one heavy stream, than n smaller ones

Other cool things include:

  • Manipulating the audio-video stream. We can for instance isolate the audio channel, do some neat voice conversion using RNN/CNN (neural networks), and re-attach it during transcoding so you get a whole different voice
  • Process images for specific shape detection (you know… ), and automatically kick/ban improper behavior, for advanced moderation
  • Process the video stream to play some fun gesture-based games, without burdening the devices
  • And actually, the “codec-agnostic calls” allow for really, really cool things like using the (newly) built-in WebRTC capabilities on iOS 11 Safari & Android chrome to initiate calls without even a need for an app. This really paves the way for progressive apps (vs native apps), where one code base efficiently serves all platforms (ios/android/desktop)

That being said, there are obvious downsides to maintaining a MCU, the first one being: it’s super CPU-intensive (meaning, it’s really going to cost you a LOT 💵) and much harder to scale than a SFU.

Just consider this for a moment: a n-peers call means your publisher stream will be transcoded n-1 times, to match the custom requirements of all other peers. This scenario means, your MCU will spawn (n-1)*n encoders, for a single room. Not very scalable if you don’t work out some sort of mutualization or stitching! ;-)

So you need to be prepared and ask yourself: does my product really benefit from this? Do I need that much quality control (or just quality, for that matter)? Do I know how to scale? Will my product make use of transcoding?

For us, it was an instant go 👊

But if it’s a no, know you’re way better off with a SFU. If yes, well, hop on the train and start digging the WebRTC RFCs: it’s a long and challenging road, but the possibilities are amazing ;-)

(With ❤️ from Tim, Sébastien, Marc & Germán @Tribe Video)

--

--

Timothée Le Borgne
Inside Tribe

VP Product @Jellysmack • CPO @RATPGroup • Product Director @Doctolib • Product @Tribe • Founder & CEO @NomadCast & GamaSix. We do magic with code.