C++ and WebRTC: Video and Audio to the Browser

Chris
4 min readMar 10, 2022

The goal of this post is to describe some of the issues I encountered implementing WebRTC video and audio streaming to browsers in Monocle Security’s CCTV software.

Receiving the video and audio data

The simplest source of video and audio data for this example is RTSP because it will give us the SDP which we can directly compare later with the browsers SDP, however the same ideas can be used when streaming from any source whether it is a file, another WebRTC server, a USB camera or anything else. I’ll be focusing on H264 because it will allow the possibility to avoid transcoding, and the WebRTC standard guarantees support for it. The SDP you receive from the DESCRIBE request from the source will contain something that looks similar to this:

a=rtpmap:96 H264/90000
a=fmtp:96 profile-level-id=420029;packetization-mode=1; sprop-parameter-sets= Z01AM42NQB4AIf/gLcBAQFAAAD6AAAnEDoYAD0/rvLjQwAHp/XeXCg==,aO44gA==

The values that are relevant for our purposes here are:

profile-level-id: A base16 value split into 3 equal parts for the profile(Baseline, Main or High), constraint flags and level.

packetization-mode: Tells us what type of NAL units we can expect to receive from the encoder.

sprop-parameter-sets: A comma separated base64 binary blob that tells us detailed information about the encoder commonly known as the SPS and PPS fragments.

Once the RTSP/RTP stream is setup, the data will flow as RTP packets in H264 NAL format which need to be depacketized into discrete frames.

The Peer connection

Currently the creation of WebRTC peer connections is not defined by any standard, but this may change in the future as more standards are introduced. There are many C++ libraries which include WebRTC functionality but the only one I found which gave enough control for Monocle’s purposes was libdatachannel. The main alternative is Google’s WebRTC framework which negotiates the SDP itself and forces the user to transcode the data which increases latency and CPU load. There has been efforts to change this, but none have made it into the main branch as far as I am aware.

One of the requirements for initialising WebRTC connections are STUN and optionally TURN servers which help determine what the best route for traffic to the browser might be. There are publicly available servers that you can use, but you can also run your own.

Browser SDP negotiation

The browser or the server can initiate the SDP negotiation, Monocle chose the browser to be the “sender” because it allows the server, the “receiver”, to make the final choice on which format to use for video and audio. The browser will send a set of media descriptions each representing a format which should contain a variety of H264 and VP9 video formats at a minimum, as well as a suite of audio formats. The response should then contain corresponding media descriptions for each entry enabling and disabling them as desired.

Most browsers will offer multiple H264 media descriptions so we need to choose one. The simplest way to choose is to find entries where the packetization mode matches and then choose the one with the highest profile and level. A more accurate choice can be made by matching the profile and level we received from the SDP in the RTSP stream. For audio, we should try to match the original source format and the SDP as well, but if we can’t, we can either disable the audio, or choose one which we will transcode to.

Included in the SDP we need to provide ICE candidiates which include the location(s) of the STUN server(s), and a DTLS fingerprint which will be used later by the browser to create the DTLS connection which will provide a mechanism to exchange keys so the browser can decode our encrypted video and audio data via SRTP.

Sending the data

If the server was unable to negotiate a directly comparable format for video or audio, it will be necessary to decode the original stream and encode it into a suitable format, I found this is more likely to occur with audio, but your browser may differ. Now you can repacketize the H264 data in RTP NAL format, and then encrypt it with AES, convert it to SRTP and send it to the browser. libdatachannel will do a lot of this for you, but you will need to fully depacketize the into the correct format that it expects. Another reason for depacketizing and parsing the H264 data is that it might be necessary to prefix key frames with the SPS and PPS data collected from the original sprop-parameter-sets, there is very little harm in doing this as they are small and it should make the stream more reliably decode.

One of the issues with debugging WebRTC is that if something was setup incorrectly, you are likely to get no output whatsoever and no reasonable feedback as to what the problem might be. One of the tools to help overcome this that I found useful was Chrome’s webrtc-internals page which gives you lots of information and may help you decipher what is causing your issue.

Conclusion

WebRTC is an excellent method for sending video and audio data because it is the only way to provide low latency media to the browser. It provides a method, albeit very complicated, for sending data to the browser which has been received over RTSP and other formats without requiring any major repackaging into files, processor intensive transcoding or loss of quality.

If you have any additional questions, please don’t hesitate to contact me on any of the platforms or places below.

Discord Patreon Reddit Monocle Github Twitter

--

--