From Zero to Hero with WebRTC in JavaScript and Python in small snippets of code. Part 1
WebRTC is a great technology. I started playing with it on one of my personal projects and became more and more amazed by it as time went by.
But I felt it had a steep learning curve. There are tons of tutorials, all with different approaches. Some were full demos that required hours to understand, others were just incomplete snippets of code, and most of them had a lot of noise like HTML code, authentication, room ids, WebSockets, and more.
In this article, I will attempt to share a step-by-step understanding of the Javascript API in snippets of code no longer than 50 lines.
What is WebRTC and why would you use it
WebRTC is an open-source project providing peer-to-peer, real-time communication capabilities to web browsers, mobile devices, and any other device that can run code using the available APIs for Javascript, Python, .NET, Golang, and more.
Whether you want to build a chat or video conferencing application, or you want to build distributed systems that communicate efficiently without putting a huge strain on a central server, WebRTC is the right tool for you.
How it works. Short story
There are tons of materials on the internet with more detailed information about this, so I will only try to clarify some concepts that confused me when I began.
Peer-to-peer communication is possible, but not directly because of NATs, private networks, and all that good stuff that makes IPv4 still usable. To handle this, there are 3 main types of servers that are involved in the whole architecture of WebRTC:
- In order for two peers to exchange packets directly, first, they have to find out how they can be contacted directly. This information is gathered through the use of a STUN server. There are public and free STUN servers, and optionally you can configure one of your own. For this article, it is not going to be necessary, because our packets will not leave the local network.
Thus, this server is used only at the beginning of the connection, or during resets. - Signaling Server. I would say is the most important. It allows two peers to exchange that information about how they can connect directly. There is no specification about how it should be implemented. It requires coding, and it can use Firebase, or it can be a Django app, it can run on WebSockets or just HTTP. In this article, we will act as the signaling server by copy-pasting strings.
This is used at the beginning of the connection, or for updating an already established one. - If two peers can’t exchange packets directly, the packets will have to be relayed through a TURN server. This is again something that requires just configuration.
If this is necessary, it will be used throughout the whole communication session and it will require bandwidth.
Connection setup with YOU as signaling server
The nature of the WebRTC connection setup is asymmetric. For a few examples, we will use two branches of code, one for the caller peer, and one for the callee peer. Later on, we will upgrade the code to be identical for both of them.
In this initial demo, we will act as the signaling server by copy-pasting the information that needs to be exchanged between the two peers in order to connect them. To keep things simple I didn’t introduce any handlers for messages.
API
The WebRTC API is a bit verbose. For what feels like could be covered by one method, it’s actually divided into 2–3 operations.
I will give a short overview of the API. Don’t waste too much time understanding it. Just come back here after you see the code. I won’t include every possible definition as these can be quickly googled up.
Every peer has an instance of RTCPeerConnection
.
Every instance of RTCPeerConnection
has a localDescription
and a remoteDescription
of type RTCSessionDescriptionInit
.
An instance of the latter class can be of type offer
or answer
depending on who created it, the caller or the callee. It also contains a Session Description Protocol (SDP) string that contains information such as RTCIceCandidates and the channels (such as text, media).
Here would be a class diagram for the above description.
NOTE! This is not the official class diagram, and the property type
does not exist. It’s just something I made up to make the explanation easier.
Again, alocalDescription
contains info about what type of data the current peer can exchange (channels such as text, media), and how it can be reached (ICECandidates).
A remoteDescription
contains info about what type of data the other peer can exchange and how it can be reached.
A localDescription
can be of type offer
if the current peer is the caller, and it can be of type answer
if the current peer is the callee.
Similarly, a remoteDescription
can be of type offer
if the current peer is callee, and it can be of type answer
if the current peer is the caller.
Here is an object diagram for what I described above.
Caller peer
1. The first step of the caller peer is ininitializeBeforeCreatingOffer
to create an instance of RTCPeerConnection
which handles much of the underlying protocol and add to it an instance of RTCDataChannel
which handles message sending.
It’s necessary to add at least one channel, in this case, it will be a text channel, otherwise, any information exchange won’t establish any connection.
We will also add a handler for the connection state and log it in the console to monitor what happens.
2. The piece of information that will be sent to the other peer in order to connect with him is called the offer Session Description Protocol (SDP). It contains information about the data channels that will be used, and how the peer can be reached.
First, we create an offer, and we need to assign it to itself using setLocalDescripton
. After this method is called, ICE candidates are gathered asynchronously. They specify how the peer can be reached, and we want to wait for them to be gathered so that the SDP of localDescription
contains them. That will be the offer.
Since we take the role of the signaling server, the script will log to the console the SDP, and we will copy-paste it to the callee peer.
3. The last step of the caller peer is to wait for an answer, and after it gets it, it will save it using setRemoteDescription
Here are the helper functions that are used in both peers.
Callee peer
- As the callee peer, in
initializeBeforeReceivingOffer
we don’t have to declare any DataChannel. We will receive an offer containing one. - We receive the offer by copy-pasting it in the prompt and saving it using
setRemoteDescription
. - We create an answer, save it using
setLocalDescription
. Again, after calling this setter, the ICECandidates are gathered asynchronously, and we need to wait for them so that thelocalDescription
SDP will contain them.
Next, we can copy-paste the answer in the caller peer.
If things were not entirely clear by now, maybe a sequence diagram of the whole process can help. I attached the link below to the “Handbook of SDP for Multimedia Session Negotiations”. It includes other steps such as adding video tracks, but other than that it represents almost perfectly the process above.
Conclusions
In this tutorial, I have shared with you the most basic setup of WebRTC with asymmetric code, directly in the browser, without HTML elements, and without much interaction.
Here is the directory in my GitHub repository where the code for this article resides.
In the next tutorials, I will explain alternatives for connections setup and other approaches that you may find on the internet.