End-to-end (E2E) encryption is a bit of a buzzword these days. Everyone wants it and every company is jumping into the ring to claim that they have it. It makes sense. Who doesn’t want a completely unhackable application? However, end-to-end encryption (especially for media within a browser) is extremely new and some limitations are often glossed over.
What is End-to-End Encryption
End-to-end (E2E) encryption in video conferencing is a way to secure data that prevents third parties or intermediary servers (SFUs, TURN Servers, Gateway, etc.) from accessing or tampering with it at every hop along the media pipeline.
One easy way to think of true E2E encryption is as if all video data from the time it is captured by the camera to the time it is displayed on a screen is double encrypted. The content is encrypted first at the application layer and it is encrypted a second time at the network layer. The video platform provider generally takes care of the network layer, but the application developer is responsible for encrypting the application layer.
Due to its very nature, true end-to-end encryption can never be wholly supported out of the box in an SDK. To fully protect the data between the camera and all intermediary servers, the application developer has to be the one to encrypt the data at the application layer before it is even sent to the server and also decrypt it at an appropriate moment before display on the viewer’s screen.
Network Encryption — The First Layer
Even without securing the application layer, WebRTC encryption at the network layer is very secure. Encryption is a mandatory part of WebRTC security architecture and is enforced on all aspects of establishing and maintaining a connection.
Here are three different endpoints that are exposed to the public internet that can be secured:
- The Gateway
- Administration Console and Rest API
- The Media Server
Application Encryption — The Second Layer
To be considered end-to-end encrypted the application layer must also be encrypted. This means that you must first encrypt the messages yourself before sending them to the WebRTC server. This is accomplished in many different ways depending on the type of media that is being sent and whether you are using the native stack or a web browser.
Native Stack vs. Browser
When discussing E2E encryption it is important to differentiate between E2E encryption for native apps and E2E encryption for web apps. When you see companies saying that they support E2E encryption, they are generally referring to E2E encryption in a native app.
That said, as of today, performing true E2E encryption in browsers is still wholly dependent on the browser vendor’s stack and is hence not possible to do securely at this time — with one exception. In May 2020, Google added an experimental API behind a flag in Chrome that could pave the way for broad E2E encryption support if adopted by other major browser vendors.
End-to-End Encryption — in the Native Stack
To have E2E encryption for your native stack, you need to encrypt on the application layer. This is accomplished in different ways depending on what type of message is being sent. Streaming media such as audio, video, and data channel traffic need to be encrypted before it passes through the Media Server, and chat messages need to be encrypted through the Gateway.
End-to-End Encryption of Messages and Chat in the Native Stack
E2E encryption of text-based messages is fairly straightforward. For any application layer message to be E2E encrypted, you must first have a way of generating and sharing keys with the participants outside of the environment. There are several different ways to do that.
The most straightforward method is to encrypt and decrypt every message using the same key. This allows anyone with the key to decrypt messages coming from anyone in the conference. Another method is to give each participant a unique key.
Regardless of how you choose to manage your keys, to be considered end-to-end encrypted the messages must be encrypted on the application side first before you call the API to send the message to the Gateway.
End-to-End Encryption of Data Channels in the Native Stack
Encrypting data channels at the application layer is very similar to encrypting text-based messages. The main difference is that messages are typically sent through a Media Server data channel.
If you want to end-to-end encrypt your audio streams, you first need to gain access to the audio frame/packet after compression to encrypt it before it is sent to the Media Server.
End-to-End Encryption of Video in the Native Stack
End-to-end video chat encryption is more complicated than audio. This is because compressed video frames often depend on other compressed video frames, while compressed audio frames can be decoded independently of each other.
The keyframe in a video stream acts as the foundation for subsequent frames. It is sent out first and is followed by “delta” frames that contain compressed information about the changes since the previous frame.
To make this possible, the first few bytes of the frame need to be left unencrypted. The exact number of bytes varies depending on which codec is being used (VP8, VP9, H.264). For example, with VP8, we need to leave 3–10 bytes unencrypted because those are the bytes that make up the frame header.
End-to-End Encryption within a Browser — The Current State of Affairs
While end-to-end encryption in the native stack has been around for a while, E2E encryption in a browser is a completely different story. Historically, web browsers have not provided developers any means of modifying audio or video before sending it or playing it back, so E2E encryption in a browser was impossible.
Over the last few years, there have been a couple of IETF drafts that have tried to create a standard to tackle the end-to-end encryption problem.
The most promising were:
- Privacy Enhanced RTP Conferencing (PERC) — Draft has expired
- PERC Lite
- Frame Marking — Google Chrome has discontinued support
However, none of the proposed methods gained the traction or the browser buy-in that they needed to be usable in the real world. Support for these drafts has now been discontinued. This left everyone needing E2E encryption in the browser with no options.
RTCRtpScriptTransform — a Promising but New Model
However, this may change soon with RTCRtpScriptTransform, a new feature for WebRTC that is currently in a draft specification as of April 2021.
RTCRtpScriptTransform seeks to solve the problem of E2E encryption within a browser by allowing an application-level code to access and manipulate the underlying RTP stream data in an RTCPeerConnection. The exposed data packets are encoded but not yet encrypted by the network layer, which makes it an ideal time to apply an extra layer of encryption using keys known only to the end-users. The encryption performed here can be totally custom and uniquely tailored to specific codecs, media servers, or security requirements.
Sounds promising, right? It is, but the specification is still an early draft and yet to be implemented in any major browser. Google Chrome includes an experimental feature similar to RTCRtpScriptTransform called WebRTC Insertable Streams, but the API is already deprecated in favor of the official W3C draft.
Wrap it Up
So where does that leave us? End-to-end encryption is a complicated topic. While E2E encryption in the native stack is well developed and has been possible for years, E2E encryption in the browser is brand new and at the early stages of experimentation. There is no company out there today that can honestly claim to have full support for E2E encryption in the browser.
The industry is just now at the precipice of E2E encryption in the browser. It is an exciting time and companies are racing to add it to their product lineups. However, this environment is changing fast and video platform providers will have to be able to change with them.