This article is the 3rd one in the series ‘Building Myntra’s Video Platform’. In case you haven’t read the first and second parts, you may want to read the same for better context.
In this article we will dwell a bit more on the WebRTC side.
WebRTC enables two way communication between multiple people, typically less than 1000. The cost of running WebRTC for a very large number of users is very high. The lag / latency is less, running in less than a second, can run in a few ms too.
One of the questions which people generally have in their mind around WebRTC is whether is it the only solution which enables such communication? The answer is no. It is possible to build custom solutions too. As per How Zoom’s web client avoids using WebRTC (DataChannel Update) — webrtcHacks, Zoom initially used Websockets and WebAssembly to transfer and decode media initially and later moved to WebRTC Datachannel as of 2019 but still doesn’t use the complete WebRTC stack. There are many other options too.
However, using WebRTC has many advantages the most important ones being
- Its an Open Source software
- It is Free
- It has support of Large developer community
- Its an end-to end platform and is device independent
Some of the most popular firms & products using / have used WebRTC in some form or the other include
- Google (Meet, Duo, Hangouts)
- Meta (Facebook Messenger, Instagram Live, WhatsApp)
- Microsoft (Teams)
- Amazon (Chime)
- Cisco (WebEx)
Lets talk about the WebRTC constructs.
Signaling is the process in which two devices attempt to connect using the server for coordination. Post the connection, the server is not needed and direct peer to peer connection can happen. WebRTC leaves this part for people to implement it in a way they deem fit.
WebRTC allows peer to peer communication directly. It is UDP based. Peer 1 captures video and audio encodes and sends the video across to peer 2 who decodes and vice versa.
The content that is sent between the two clients can be video/audio or file or whatever. This is sent over a secure Data Channel. WebRTC uses Opus audio codec and H264/VP8 for video.
Session Traversal Utilities for NAT (STUN)
A client may be behind NAT and may have access to local IP addresses only. Client can reach out to the STUN server which then tells the client ‘s NAT’d IP address. In Peer to Peer (P2P) mode, STUN server can be used to find the IP addresses of both the clients. It may not work well with firewalls. Hence, STUN servers alone can’t help in all cases.
Traversal Using Relays around NAT (TURN)
In the cases where the Stun server is not able to tell the IP address all the media may be relayed through the TURN server. This is more bandwidth intensive and TURN servers may be costlier than STUN. It can introduce delays because of the traffic going through it.
However, to make sure that all the people who are unable to get an IP address through STUN or are unable to connect using WebRTC, TURN may be needed too.
Interactive Connectivity Establishment (ICE) protocol
ICE generates possible options for media traversal. Multiple candidates are generated using ICE, which clients can try connecting to. It is used both in P2P, SFU etc modes. More on that later.
ICE server can be utilized for the same purpose. ICE may be done through STUN or TURN servers.
The ICE flow in WebRTC can be classified in three main phases:
- Gathering candidates
- Verifying the candidates
- Final nomination
In the first phase, ICE agents on both sides of the connection collect a list of candidate transport addresses that can be used for communication. Remote agent gets the info sorted by priority.
In the 2nd phase, the agents exchange candidate lists and perform connectivity checks on each candidate pair to determine which ones are viable.
Finally, during the last phase, the agents select a single candidate pair for use in the connection.
Image below shows a sample ICE Server response mentioning the STUN and TURN options.
Example of a ICE for a meeting on Google Meet
ICE provides multiple ways for clients to connect between each other and with SFU (more on that later). The options with their pros and cons are listed below:
- Suits VoIP
- Lossy so unreliable
- Can get blocked by firewalls
- More reliable
- More relaxed rules on firewalls
- Less suited than UDP for VoIP
TLS over TCP
- Similar benefits like TCP
- Overhead of Double encryption
TLS over TCP means encrypting already encrypted traffic. WebRTC by protocol definition is transport encrypted, so another TLS over it may be wasted.
Based on the Pros and Cons and the requirements either UDP or TCP based approach can be chosen.
Here is what a simplified WebRTC piece looks like:
Types of WebRTC connections
Two clients connect directly to each other. They may be aided by TURN servers in cases the clients are unable to connect directly.
However for P2P mode for a large number of clients it introduces multiple problems as below.
Multiple Hosts in one live stream
If multiple clients connect to each other nC2 * 2 connections may be needed. It will create a very big mesh.
For six clients, the total number of connections stands at 30. The more the number of clients, the number of connections increases drastically. For this reason, it is not scalable.
Multipoint Control Unit (MCU)
It needs to have an intermediate system to manage connections with different clients.
In MCU mode, there is a central hub called MCU. This MCU hub receives media from all clients. The media received from all of the clients is put together in a layout and one single media (video and audio) stream is generated by stitching together all media streams. MCU is computationally expensive.
This Multipoint Control Unit (MCU) can then maintain inbound and outbound connections with each client. Bringing down the total connections needed to 2n.
- One of the primary advantages is that it reduces the number of connections
- This is processor intensive
- Cost may increase
- Latency may increase as the MCU unit stitches everyone ‘s video together
- The layouting and sizing of the video gets fixed
- If the camera preview of the user is to be shown along with this video the person may end up seeing their content twice, once in their preview and second in the stitched video
Selective Forwarding Unit (SFU)
In Selective Forwarding Unit (SFU) mode, the number of connections are higher than the MCU but lesser than all the peers connected directly to each other. All clients send their stream to SFU and are given the streams of all other clients. SFU encrypts and decrypts the content and relays that.
Total Clients: n
Each client publishes one stream to SFU and gets n-1 streams back from SFU
- A lot more flexibility on layouts on Client-side
- Latency is less
- Bandwidth intensive
SFU with Simulcast
Each client sends two kinds of streams, one high quality and one low / thumbnail quality needed for the layout.
Mode 1: Active and passive speaker mode
When the client n is active high-quality stream is sent and thumbnail quality for all others is sent.
Mode 2: Pinned client
If the client n chooses some client to show in full-screen mode (or pinned) that the client‘s high-quality video is sent and thumbnail quality is sent for others.
- Suits layouting
- Low latency
- Harder to build than other modes
Here is an end-to-end flow for two people making a WebRTC call.
- In the first step signaling is done. Both peers exchange signaling messages to establish a connection. This includes exchanging information about the network topology, such as IP addresses and ports, and exchanging session descriptions that describe the media streams that will be sent and received.
- Gathering candidates
- Verifying the candidates
- Final nomination
- DTLS Handshake: Both peers perform a DTLS handshake to establish a secure connection.
- Media Stream Setup: Both peers set up their media streams, including audio and video tracks.
- Media Stream Exchange: Both peers exchange their media streams with each other over the established connection.
- Media Stream Teardown: When either peer wants to end the call, they tear down their media stream and close the connection.
In SFU mode the step 5 in P2P mode is different. Media exchange happens with the central SFU server.
Building a WebRTC tech stack
WebRTC exposes SDKs for web, mobile platforms, and desktop, etc. The diagram above shows the different layers present in the SDK.
Webapp has to be developed on our own and all the constructs as explained earlier have to be accounted for too.
As is evident that there are a lot of things to be built in WebRTC. Building from scratch will require a lot of effort.
Thankfully there are quite a lot of opensource solutions which utilize WebRTC SDK.
Open source solutions which were evaluated include:
Based on our evaluations Jitsi was chosen.
- Completely free
- Battle tested on Prod: https://jitsi.org/jitsi-meet/
- Fairly popular
- Has Android, iOS full fledged apps and SDKs
- Web implementation present
- One of the best opensource SFU present: https://webrtchacks.com/sfu-load-testing/
- Makes multiple users in the same livestream possible
- Has video capture module which can be extended and modified
On the performance front
Jitsi Videobridge Performance Evaluation | Performance Testing talks about performance of the VideoBridge component of Jitsi.
Another link talks about the Perf too: https://github.com/jitsi/jitsi-meet/issues/446
We decided to use Jitsi for the core WebRTC stack. Components of Jitsi WebRTC were evaluated and we decided to modify some of them to suit and build our use cases. In addition to the WebRTC part, there were multiple other pieces which were built from scratch at Myntra.
We will discuss more on this in the next article.
That concludes this article on WebRTC. In the next one, we will focus on the live stream architecture at Myntra.