Tools we developed to troubleshoot WebRTC at high scale

Denis Sicun

Published in

Kaltura Technology

6 min readApr 16, 2023

Video conferencing sometimes feels like a modern-day séance. “Hi Alice, can you hear me? Alice? Are you there?”

With the rise and standardization of WebRTC, browser-based video conferencing solutions have become very popular in post-COVID years. But supporting such complex systems at the scale of millions of daily users is a different kind of challenge.

Why is it so hard to troubleshoot WebRTC systems?

WebRTC is tricky. It tries to make very complex processes simple. But with this simplicity, comes a very wide range of knowledge you need to master to efficiently create systems using this cool technology. You can break this knowledge into 3 main topics: media processing, networking, and signaling infrastructure. Additionally, there are more race conditions than you can imagine or reproduce.

In this post, I’ll give you directions to tools you might need for supporting your WebRTC video-conferencing application. These words come from my experience of scaling and supporting our meeting solution to millions of users.

How did we start?

Our story begins 5 years ago at a small office in central Tel Aviv, where a small startup (Newrow, which Kaltura acquired 2 years later) was starting to get real customers into its virtual classroom solution.

With real customers started to come real bugs. At first, the bugs were fairly easy to reproduce. Slowly, they became more mysterious and harder to understand. We understood that we needed something other than our product manager forwarding us angry emails.

Logs, logs, and more logs

The first thing needed to start any investigation is evidence. The more evidence you have, the better conclusions you will get.

For any developer, the best evidence source is logs, so we decided we had to have a system that would collect and display them on demand. My focus here will be on client-side logs, as most of the fun with WebRTC starts on the browser.

There are a lot of products that can help you achieve the task of collecting client logs, but they get very expensive as your application grows. Unfortunately, with WebRTC, there are a lot of logs to collect.

We decided to start with a simple solution that could upload compressed zip log files to cloud storage (like AWS S3) and associate them with a customer support ticket.

When a customer submits a support ticket, the customer sends us the relevant data related to that session. Additionally, we calculate the WebRTC statistics for that specific moment in time from all the participants. We learned the hard way that the reporting user is not necessarily the user with the problem.

Here are some pieces of data we receive from each peer connection at the time of an incident:

Signaling state — very useful to know if you set SDPs
ICE connection state + connection state — usually both show you the same value, but connection state also reflects the DTLS state.
Creation and destruction time of the peer connection — from this, you can also calculate the duration of the peer connection.
getUserMedia constraints and errors — these errors usually help support engineers work directly with the client without your involvement.
List of all detected ICE candidates types
Is this a publishing or a viewing connection?
WebRTC Stats such as packets lost, RTT, Jitter and more

On top of the raw data, we added some simple UI to display the content of the log. Very similar to a terminal. Nothing fancy, but efficient 🙂

Cautions: Make sure you clean up logs once in a while. Otherwise, you will be storing and paying a lot for ancient and useless data.

Taking it up a notch

When our system reached tens of thousands of concurrent users, we couldn’t reach all support tickets that had been opened for our team. We decided to take our bug detection one step forward, and started the journey of building a tool to collect RTC stats and other information proactively.

The browser gives us so many stats for free, so why not use it to look for bugs?

One of the strongest data points we decided to collect was the reason a peer connection was closed. Was it normal? Was it interrupted? Was the connection even established?

Additionally, for each peer connection you can get stats, such as the total audio/video bytes, peer connection state, ICE connection state, viewer or publisher, the time it took to set up the connection, and the time it took this peer to exchange the SDP offer/answer.

This helped us categorize our data and find really simple yet interesting trends.

How do we collect the stats?

*Diagram of our beacon collection infrastructure*

When a peer connection reaches a final state, we calculate the relevant data and send it to a cluster of NGINX servers. Each NGINX server logs the request and resolves it. Next, we use td-agent to extract, transform and load the data into an ElasticSearch. Last, we explore the data and build visualizations and dashboards.

There are a few benefits to this design:

It was pretty easy to scale, since we were just scaling an NGINX log. A few small pods in a K8s cluster can do the job for the highest load.
It’s cheaper than a fully managed solution. You pay only for some storage and a running ElasticSearch.
It takes very little maintenance after the initial setup. Simply clear old data once in a while.

Lessons learned:

Use the window.onunload event to collect data on peerConnections that are active while the user closes the browsers tab/window. This guarantees you won’t lose a big portion of your data. Most of it is usually peer connections that successfully connected and worked well through the entire call.
Make the process of calculating the payload synchronous since, when the browser closes a tab, you won’t be able to complete async operations. This is very tricky since the getStats method is async. But with the right design and implementation, you can get good enough results. You don’t have to be accurate to the byte!
Use the navigator.sendbeacon method to send the beacons. This guarantees the data is sent even if the browser tab is closing. This API doesn’t wait for a response, so it fits perfectly with lessons 1 and 2.

How do we combine both tools?

At this point, our system has 2 very strong tools for diagnostics and research. We can now discover interesting trends and bugs in our system. We can assess the quality of a newly released feature. And, of course, we can investigate the data and decide on the next courses of action for our future tasks and development efforts.

Here are 2 examples of how we use those mechanisms.

Hanging peer connections after getUserMedia fails
We saw a trend where we had publishing peer connections getting stuck in connectionState ‘new.’ This is strange since it means we didn’t get to the point of setting a local SDP. After an investigation of those beacons, we found that a lot of those peer connections didn’t succeed to getUserMedia. We had a bug in one of our flows in the application, which had created a peer connection despite the fact getUserMedia failed.
Monitoring
After every release, we monitor our stats to see if there is an unexpected trend. One example of this was when we saw periodic spikes in abnormal disconnection. We discovered that we had introduced a bug in our autoscaling mechanism that scaled-down SFUs too early, while they still had some usage. Without our RTC stats infrastructure, this kind of bug would’ve been very challenging to discover and fix.

Final words and thoughts

These tools are now part of my and my team’s day-to-day work processes. Every morning, I check the recent trends I’m following. This is how we “live the system.” In the long run, this infrastructure saved us a lot of money and time, in terms of solved bugs and customer satisfaction.

Finally, collecting so much data opens some interesting doors to solving other kinds of problems in the world of video conferencing, such as quality assessment. I believe that in the near future, we will also expand our efforts in that direction.