WebRTC Architecture Basics: P2P, SFU, MCU, and Hybrid Approaches

Grasping basic concepts, terminologies and architectures

Published in

SecureMeeting

9 min readMar 10, 2021

This post is a part of a long series of articles we’re releasing to the developer community to help everyone build amazing live streaming solutions. For Part 2 of this, click here. — Team SecureMeeting.

So you want to build a live streaming solution, and you are torn between different architectures? You understand the power of live streaming, but the sheer volume of esoteric jargon has bogged you down? We are here to help!

If you have a basic understanding of networking, we aim for this series of posts to be everything you need to grasp WebRTC live streaming.

What is WebRTC, and why is it awesome?

Despite over 60+ years of Internet evolution, live streaming on the Internet was not very accessible to the developer community until 2012. Google acquired this little-known initiative called WebRTC and open-sourced the project to democratize live streaming. With solid backing from IETF and with increasing adoption of web-browsers, WebRTC is set to disrupt the very notion of live streaming. We firmly believe that much of live streaming traffic will be WebRTC enabled in the next decade or so.

WebRTC takes care of transcoding, packetizing, networking, and security all rolled into a few easily accessible API. It allows the developer to focus on building compelling live streaming applications without worrying about the underlying mechanics. The only thing WebRTC does not take care of is signaling.

WebRTC basics #1: Transcoding

Converting media from different sources to a common format

Transcoding is a term used for “encoding” and “decoding”. It is a process used to ensure consistent media format during communication. When media is captured from a camera/microphone, it is often in a raw and uncompressed format. Further, different manufacturers use different (raw) media formats to preserve the quality of capture. Encoding is the first step to reduce the file size to make low-latency streaming possible. To make media sources from different vendors ready for interplay, they must be encoded (and later decoded) using the same language; this process of reducing media to a common platform is called “transcoding”.

WebRTC uses a few transcoding formats, namely, VP8, VP9, H.264 for video and Opus for audio. Your choice of codec will depend on how you want your participants to view the video.

Browser Compatibility (Video & Audio codecs)

Chrome: Video (H.264, VP8, VP9), Audio: AAC, MP3, Vorbis, Opus

Firefox: Video (264, VP8, VP9 AAC), Audio: MP3, Vorbis, Opus

Internet Explorer: Video (H.264), Audio: MP3

Safari: Video (H.264), Audio: MP3, AAC

iPhone: Video (H.264), Audio: MP3, AAC

Android: Video (H.264, VP8), Audio: AAC, MP3, Vorbis

Chrome Android: Video (H.264, VP8, VP9), Audio: AAC, MP3, Vorbis, Opus

Video: Based on the breakdown above, H.264 is a format that is more universally supported across different browsers. There are very little gains in switching to VP8 or VP9, unless your end device requires these formats (the original version of WebRTC supported just VP8).

Audio: You might run into situations where the video works, but the audio simply does not. In our experience, this could happen with Firefox, Safari, IE-Edge, or Opera. As the table above shows, AAC finds more universal adoption (versus the default Opus).

WebRTC basics #2: Transport Layer

Once the computer gets the data from the webcam and microphone and it is transcoded, it must then packetize this information to send it to the other people in the call. The foundational protocol that powers WebRTC is UDP. While UDP does not ensure packet delivery or that the packets will arrive in order, it is the best delivery protocol for live streaming because it ensures speedy delivery of packets. On Top of UDP, WebRTC implements many more protocols to enable the ability to establish and maintain peer-to-peer connections, secure data transfers, and provide congestion and flow control. The protocols that deal with the peer-to-peer connections are ICE, STUN, and TURN. They do this by negotiating traversing NAT’s when establishing peer-to-peer communication sessions. DTLS is used within WebRTC to secure and encrypt all data transfers between participants. Finally, SCTP and SRTP are used to multiplex the streams and provide both congestion and flow control.

WebRTC basics #2: Security

One of the primary tenets of WebRTC was to bring end-to-end encryption to all video calls. WebRTC approaches security from a variety of angles starting from the protocol level to the browser level. For a start, every stream sent between participants is encrypted with Secure Real-Time Protocol (SRTP). This protocol ensures that the session is encrypted and only the two participants can view the contents. To generate the keys to encrypt the session, WebRTC utilizes DTLS-SRTP which exchanges keys directly between peers on the media plane. Additionally, WebRTC requires a secure signaling server to establish the peer connections. This signaling server needs to be using the HTTPS protocol, which encrypts the contents sent across the signaling server. At the browser level, there are a variety of factors that ensure your security. The first feature that browsers implement is requiring the user to give consent to the use of the microphone and webcam before the browser can access the hardware. Secondly, browsers require that a website uses HTTPS to grant access to WebRTC features. Lastly, browsers anonymize your device information until a user has directly granted access.

WebRTC basics #3: Signaling

Signaling is the process of maintaining a notion of “who” is part of an ongoing live stream session and who is not. WebRTC does not handle signaling — it is left to the developer to take care of. This turns out to be a great thing.

In the telephony world, signaling is the part where you place a phone call to a friend and their phone starts to ring. They must accept your call for the actual conversation to initiate. This part is the “signaling” — you are inviting (‘signaling’) folks to join a call, and this phase is distinctly different/separate from the actual data transfer during the call.

There are a variety of signaling protocols one can use. The simplest is websockets. The next better version is socket.io (by Facebook). The most advanced and complete suite is session initiation protocol (SIP). Picking the right signaling will help you scale well and do more advanced things (like letting users dial-in to a WebRTC session by phone, for e.g.,)

We next turn to the popular topologies used in WebRTC architecture, namely P2P, MCU, SFU and Hybrid.

#1 Peer-to-peer mesh (messy) architecture

WebRTCs original implementation was only purported to support peer-to-peer (P2P) communication.In a P2P, each participant is directly connected to every other participant via an active connection/data-channel. Each participant sends their video and audio individually to every single peer and downloads video and audio from every peer. The number of connections each peer is handling in this topology is equivalent to the total number of peers minus one. For a session with N participants, the total number of connections grows as O(N²) — which can get messy very quickly.

The pros of having a P2P topology are extremely low latency (lowest possible, in fact), high security by default (streams are end-to-end encrypted), and nearly zero operational costs because there are no servers required. In our experience, it costs as little as $50/mo in hosting costs to support a global collection of callers with a DAU of 400 calls/day.

The cons of this topology, however, and manyfold: (i) scales poorly: with almost N² connections, communication can breakdown (hard to support more than 8 people in a call), (ii) connectivity problems: most callers are typically behind a NAT box, which means their true IP is hidden. Although this can be circumvented to an extent with TURN/STUN servers, many firewalls still tend to actively block connections, (iii) heavy load on endpoints: each node has to actively send and receive N-1 video streams, which adds again to the scalability problem (picture trying to download and upload 8+ online videos at the same time — you get the drift).

#2 Multipoint Control Unit, or MCU

In a multipoint control topology, each participant connects to a server known as a multipoint control unit (MCU). The job of the MCU is to receive media from each participant, decode it, and mix the audio and video from the participants together into a single stream and send it to each participant. When using an MCU, each participant uploads their stream just once to the server and the server then sends one stream back to each participant, containing the mixed audio and video from all of the participants.

The pros of using an MCU are that it works especially well in low bandwidth environments and scales very well with an increasing number of participants. The cons are that the server needs to utilize lots of CPU power to mix the streams of all the participants. Also note that in this set-up, transcoding is pushed entirely to the server — which dramatically increases the CPU load, while also increasing latency because of transcoding delays.

If you want to run media processing (like openCV) on your video streams, or introduce custom backgrounds/blurs, MCUs might be the best way forward. MCUs will also dictate media layout and end-users will have limited choice in this matter.

#3 Selective Forwarding Unit, or SFU

In a selective forwarding topology, each participant in a session connects to a server known as a selective forwarding unit (SFU). At its core, SFU is just a “forwarder” — meaning little to no processing happens here.

Each node sends it’s transcoded media to the SFU, which then forwards this to all the nodes in the session. Unlike the MCU approach, transcoding happens at the edges and not at the server. In this scenario, clients only upload one stream while actively receiving N-1 streams from the SFU. This turns out to be a healthy compromise — most residential homes have weaker up-link bandwidth but higher down-link speeds.

SFU based approaches tend to scale very well, while keeping the server load to a minimum. Reducing the server to a ‘forwarder’ also keeps deployment costs fairly low. Also, specific features like background blur and video layout are also pushed to the edge nodes.

The downside to SFU is security — by definition, media is not end-to-end encrypted. Why? Because the video stream has to be terminated and re-created at the SFU. There are some (experimental) approaches to overcoming this by means of double encryption (see here). The core idea is really neat: split the media packets into headers + data, and encrypt them independently. While the header continues to be decrypted at the SFU for forwarding, the media content is only decrypted at the edges, thereby ensuring end-to-end encryption despite the middleman.

#3 Hybrid approaches: SFU Trees and SFU/MCU/p2p on demand

A more realistic approach to scalable and planetary scale live streaming is to build a platform that dynamically selects the underlying architecture to deliver live stream depending upon the number of participants, their geographical location, and their networking bandwidth.

A simple strategy is to use P2P mode when participants are less than 4, and switch to SFUs when it crosses that threshold. Most open-source SFUs today can scale well to about 20–30 participants.

A complete list of directions on this is probably outside the scope of this medium article, but we would love to release more on this in future articles.

Closing thoughts

For far too long on the Internet, big-corporations have constantly waged a transcoding war which made media interoperable. Video industry has seen the likes of RealTime, QuickTime, WindowsMedia, Adobe Flash and many other players create silos which only hurt the common man. With the proliferation of HTTP Live Streaming (HLS) and transcoding baked into our browsers, WebRTC and allied technologies hold the promise of truly democratizing live streaming. Whether your needs are tele-medicine, work-from-home solutions, online retail, or education, this is a time to get your organization experimenting with WebRTC to bring the power of live streaming to your business.

We welcome comments and discussions about live streaming: hello@securemeeting.org

About SecureMeeting

SecureMeeting is a 501(c)(3) non-profit whose mission is to advance human rights and freedom of speech. We do this by designing, developing and deploying privacy preserving communication platforms for all of mankind.

By creating this technical series on video streaming technologies, our hope is to help developers all over the world build amazing live streaming solutions.

WebRTC Architecture Basics: P2P, SFU, MCU, and Hybrid Approaches

Grasping basic concepts, terminologies and architectures

Written by Mukund Iyengar