Drilling down to a single WebRTC session

As much as I talk about doing analytics and metrics at service-level, sometimes even I need to drill down and take a look at the raw data of a single peer-to-peer session. Tuesday last week was such a day. Andreas Pehrson poked me about a session he had earlier which was not behaving as he expected… :

Serious drilling was required for this hook. Credits: Nina Kirchner
Look at throughput — first it was high out from Telenor Moterom, then with screenshare it got negative, then after they went back it got stuck at a much lower bitrate? A pretty solid 400k. Do you know why that would happen?

Since he works on Firefoxs WebRTC media stack, and therefore knows what he is doing, I dropped all other tasks (fortunately it was beer time already) and started digging into it.

After identifying the data record in our webrtc logging backend (more on that topic soon) I got a full dump of the session including all peerconnection calls and getStats data at 1 second resolution. Just to give you an impression: uncompressed, the data (in a relatively optimized format) for this sixty minute call takes up about four megabytes of disk space. Importing and visualizing it takes half a minute in my browser.

I looked at the bitrate of video sent to Andreas. The Chrome client in the meeting room started with a bitrate of 1.7mbps. Then the participants in that room shared the screen and after ending the screensharing session the client was sending 384kpbs. Which roughly looked like this:

It took us a while to figure out what happened. Of course, we quickly found an issue with the visualization which didn’t deal with Chrome resetting the bytesSent counters occasionally such as when sharing the screen. I filed a chrome bug in December last year so I knew about this. Not the issue I was looking for...

But that did not explain why he was restricted to roughly 400kbps after screensharing stopped. However, that number looked quite familiar, especially when looking at the bandwidth estimate graph:

In appear.in we use an adaptive variant of full mesh which restricts the bandwidth with SDP manipulation depending on the number of participants. For 1–1 sessions which don’t use a TURN server we let Chrome choose the bandwidth freely which is what we see in the first part of the session.

We do restrict the video bandwidth to 384kbps per participant for three participants. So I looked in the peerconnection API calls and found the corresponding setRemoteDescription call which inserted the following line into the SDP:

b=AS:384

This line restricts the bandwidth sent out to the peer to 384kbps.

Andreas told me someone else had briefly entered the room because he seemed to be running into issues with the screen shared. In that case we should remove the b=AS line after that person had left. Did that not work? I found the corresponding removal of that line as well.

So why did Chrome stick to sending 384kbps? Then I vaguely recalled an issue in the WebRTC sample for setting bandwidth more than a month ago. A comment there said that

Removing the restriction seems to not be honored by Chrome 54

So we ran into this bug. Fortunately it turned out to be easy to work around until the Chrome issue is fixed by just setting a high maximum bandwidth instead of removing the limit. It was a three line fix but it took Andreas and me an hour each to figure out what went wrong...

Sometimes this kind of debugging is necessary and well worth the effort. Being able to retroactively get as much logging as possible on both WebRTC API calls as well as statistics from the getStats API is essential to resolve such issues.

Are you interested in debugging such issues? Maybe our customer support engineer position is right for you 😉