Jakub Leszczyński
Nov 10, 2020 · 9 min read

Peer-to-peer video-calling app with WebRTC in under 15 minutes

Introduction

Creating video-calling apps can be very challenging, especially when starting out. There are many things that you need to understand in order to make everything work as intended and the amount of information can be overwhelming. I will try my best to explain the process and all of the important points and steps necessary to create a fully functional P2P video-calling app in your browser from scratch.

If you don’t feel like following the steps, you can download the complete app here: https://github.com/jakub-leszczynski/video-calling-app-example

Let’s start by understanding the foundations of our app.

What is WebRTC?

WebRTC (Web Real-Time Communication) is an open-source project that allows video, audio, and generic data communication between peers in real-time. There are many potential uses of this technology. Starting with simple voice recorders, ending with screen sharing, or video-calling applications. You can see some samples that the WebRTC team has gathered through the years here. We will use WebRTC API to capture the microphone and camera of one peer and send it over the internet to another peer in our P2P communication.

How does it work?

Let’s take a look at the API we will be using.

RTCPeerConnection — it’s used as a starting point of any connection. It provides an interface for connecting a local peer to a remote one. After the connection is established, it is used to maintain and monitor the connection. It’s also responsible for dropping connections.

MediaDevices.getUserMedia() — this is an asynchronous method that gets user’s media devices. It allows us to capture a microphone and camera. It also asks the user for permission to use those devices. Keep in mind that it’s not strictly bound to the RTC cuisine.

RTCSessionDescription — both local and remote peers need their local and remote descriptions made in order for the connection to be established. It consists of a description, type, and the SDP descriptor of the session

RTCIceCandidate — last but not least, a class that allows us to create ICE candidates. ICE is a technique that focuses on establishing peer-to-peer connections in the most efficient way. In short, to establish a connection, both peers need to agree on protocols and routing. When either of one them finds a potential candidate, we have to send it over the signaling server to the other peer. Peers send candidates back and forth to finally agree on one. Note that even if peers already establish the connection, ICE candidates can still be sent in order to find a better solution.

Signaling transaction flow

MDN prepared a diagram that perfectly shows the basic signaling flow. Note that the signaling server is supposed to be handled by us. In our example, it will be Express.js application with the help of Socket.io.

This diagram includes all of the necessary steps that need to be fulfilled in order to establish a P2P connection on a video-calling app. Note that this is just an example and the flow might differ if steps like authentication need to be included.

Another thing to mention is that the frontend side is split into two columns. Web App and Web Browser. What does it mean for us? All of the things that are included in the Web App layer are considered to be handled by the user, meaning that we have to write code to take care of it. Web Browser, on the other hand, happens under the hood and is handled by the browser. ICE candidates are given to us via the onicecandidate event attached to our peers.

ICE candidates start to be sent out once both peers create their SDP descriptions and send them over the signaling server.

ICE candidates exchange process

Let’s take a look at another diagram.

It shows the process of how receiving and sending ice candidates should be handled. Keep in mind that the peer receives ICE candidates from the Web Browser and we should send it to the other peer. If the other peer accepts the candidate the connection can be established.

Enough theory, let’s write some code

First of all, we need to create a basic server-side application. There is a lot of back and forth communication going on between peers to establish a connection. We need some kind of messaging system in our application. It doesn’t really matter how you handle sending the messages, you can use XHR requests and everything should work perfectly fine. WebRTC isn’t really concerned with how you send messages to your peers.

The most common choice seems to be WebSocket for obvious reasons. WebSockets have been created to handle real-time messaging. Fits perfectly for our signaling server

First of all, let’s initialize a project.

npm init 

Now add the following lines to package.json file

Those are all of the dependencies we will be using. We can now proceed with our code. Let’s create a server file and fill it with a basic express template that implements WebSocket connections.

We also need to keep track of connected users in order to reach them directly. At this point, we can also send the list of connected users to our clients.

Enough of the server-side for now, let’s add all of the HTML code that is necessary for our app.

We don’t need more than that for a basic example. Note that I’ve already included the client library for Socket.io as well as custom index.js script.

At this point, we also need a way to serve our markup. The simplest solutions are the best, so let our backend server serve our file.

Let’s go ahead and start writing the client-side JavaScript.

We begin by reacting to the updates of the user list. Those lines are very custom and don’t require a lot of our attention.

Now, the fun part begins. First of all, we need to create a starting point for any video-calling P2P web application, which is the RTCPeerConnection.

We create RTCPeerConnection, later on, we will enhance it by adding video and audio tracks as well as local and remote SDP descriptions. We already covered what ICE candidates are, so with this knowledge you can suspect what ICE servers do. We need to describe a list of servers that the ICE layer can use to attempt to establish the best route between the callee and the caller.

You might have noticed this mysterious string.

stun:stun.stunprotocol.org

What does it mean exactly?

As I already mentioned, we need to establish a connection between two peers. In order to achieve that, we will be using STUN and TURN servers. Let’s set back and understand what they are.

STUN (Session Traversal Utilities for NAT) — a standardized set of methods, including a network protocol, for traversal of network address translator gateways in applications of real-time voice, video, messaging, and other interactive communications.

TURN (Traversal Using Relays around NAT) — a protocol that assists in the traversal of network address translators or firewalls for multimedia applications. It may be used with the Transmission Control Protocol and User Datagram Protocol. It is most useful for clients on networks masqueraded by symmetric NAT devices.

In other words, peers need to find themselves over the internet. Ideally, they would connect directly. STUN servers help to establish a direct connection. But, sometimes it’s not that simple. A computer might be hidden behind f.e. a firewall. In those cases, TURN servers work as a relay between two peers.

You can find lists of public STUN/TURN servers on the internet. However, Mozilla recommends to only use STUN and TURN servers that you own (at least for production). All it needs is a Linux machine and some configuration.

Now, let’s connect to our WebSocket and get user media devices in order to start capturing our audio and video.

This is the code that will be executed on both peers. This is also a part of the code where we will be asked for permission to record audio and video. Once we connect to WS we take care of stream from our connected devices and add them to our local video element and to our peer. The last step is to unlock the call button.

We start calling by creating an SDP offer, which includes information about the attached tracks, codec, options supported by the browser, and any potential candidates gathered by the ICE agent so far. We use this offer to set a local description of the peer. Last but not least, we send information about the offer to our signaling server. Let’s see how the other peer handles this offer.

Keep in mind that we have already initialized the peer and added tracks to it upon socket connection. The first thing that we need to do when responding to a call is setting a remote description for our peer. This way both peers gather information about each other. Now the other peer creates an answer. It’s analogous to the createOffer method but is handled by the receiving peer. Next, we set the local description for the callee and send an answer over the WebSocket to the caller. Let’s go back to our callee.

The last thing left to do in this exchanging offers flow is to set SPD answer of the callee as the remote description of the caller.

Now the peers know all of the necessary information about each other. Does it mean that the development is over? Not really. We have already covered the flow of the first diagram, but the peers still need to agree on configuration, thus we need to send ICE candidates.

This is to be handled by both sides. Once the offer is created, we start receiving ICE candidates from the browser. It is highly discouraged to change the properties of candidates unless you really know what you’re doing. So now, all we have to do is to send the candidate over our signaling server to the other peer. Note that at this point we can’t limit ourselves to talking about either callee or caller, as both of them send candidates back and forth.

We try to add the candidate to the peer. There’s a high chance that the peer won’t accept it and the promise will be rejected.

Just to repeat, the connection establishes once both peers agree on one ICE candidate.

We still have to add the remote stream once the connection is established.

Last but not least, we need to finish our server-side code.

This is very straightforward. All we have to do right now is to receive messages from connected peers and forward them to the other peer.

When it comes to WebRTC connections, most of the work is handled by browsers. Our server-side application doesn’t need to understand SDP semantics, nor has to be aware of the intentions. Its main job is to send messages from one peer to the other.

That concludes all of the code that was necessary to build a basic video-calling app. Feel free to reproduce those steps, open 2 tabs, and have a chat with yourself.

Conclusion

WebRTC is not easy to start with. It has a lot of different aspects that might be hard to understand, but once you get a hang of it, it actually makes a lot of sense. The topic as a whole is very complex. Fortunately, we are given good API. Other than that the browser and ICE servers do all of the dirty work for us which makes WebRTC development not bad after all.

You can find the source code of the app that we created here: https://github.com/jakub-leszczynski/video-calling-app-example

I hope this article is a nice introduction for you to start building your own video-calling app. If you’d like to learn more I highly encourage you to read a guide on video-calling provided by MDN for more detailed information.

Liki Blog

A blog created by Liki team members with heart.

Liki Blog

A blog created by Liki team members with heart. That’s our space for developers, graphic designers, IT engineers and other specialists to share tech news on software development, mobile&web applications, product design, and other issues.

Jakub Leszczyński

Written by

I’m a frontend developer. I like cooking and sharing my knowledge.

Liki Blog

A blog created by Liki team members with heart. That’s our space for developers, graphic designers, IT engineers and other specialists to share tech news on software development, mobile&web applications, product design, and other issues.