The Ørteren lake on the Hardangervidda, one of Norways premier snowkite spots, is frozen from December until May

When a browser update breaks… your native app (for a change)

tl;dr: a Chrome 56 update broke our iOS and Android apps: Video started freezing. A post-mortem with an interesting twist at the end. And I got waffles.

As I pointed out a couple of times before, browser updates sometimes break things. And of course, it happened again. Shortly after Chrome 56 rolled out, we got some reports of video in our iOS app freezing. I am very much in favor of things freezing when it comes to lakes, but not so much with video.

Luckily, our very own Ingrid experienced this and reported it to us. This allowed us to gather information and reproduce it in our local environment.

What happened?

Chrome stopped sending video at some point during the session. Audio was still flowing and video was received so the ICE connection was still up. Initially the Chrome logs looked like the bandwidth estimate had gone to 0 which seems like a bad thing:

[54753:34551:0222/153138.975001:INFO:webrtcvideoengine2.cc(1363)] Call stats: 384720583, {send_bw_bps: 49273, recv_bw_bps: 497332, max_pad_bps: 0, pacer_delay_ms: 0, rtt_ms: 291}
...
[54753:34551:0222/153149.138986:INFO:webrtcvideoengine2.cc(1363)] Call stats: 384730747, {send_bw_bps: 13744, recv_bw_bps: 373259, max_pad_bps: 0, pacer_delay_ms: 8014, rtt_ms: 292}
...
[54753:34551:0222/153142.881818:VERBOSE1:rtcp_receiver.cc(980)] Incoming REMB: 0

But we found some cases where it was freezing and the bandwidth estimate looked normal. Looking at the graphs from chrome://webrtc-internals we found another commonality:

The bitsSentPerSecond, framesEncoded and packetsSentPerSecond values dropped (or did not increase) at the point when a packet was lost and Chrome 56 did not send any more video.

When there is packet loss, the typical behaviour is to send a NACK packet, asking the sender to resend a certain packet. The resend of the packet uses a mechanism called RTX which stands for retransmission.

Now when something changes between browser versions it is a reasonable to assume that the change is documented in the change log. And the M56 webrtc release notes highlight the following:

Old RED/RTX workarounds removed
This release removes some old workarounds relating to RED over RTX, which were temporarily introduced to mitigate a bug in older Chrome versions. The workaround removal does not affect developers using the default SDP. Developers that munge the SDP with respect to RED and/or RTX, however, now need to ensure that RED over RTX is explicitly enabled, if so desired.

I read that earlier and asked a couple of questions but was not concerned since we do not munge the SDP with respect to RED or RTX.

And yet I had a gut feeling that this might be the cause of the issue we were seeing. Also the error did not reproduce in Firefox and with our SFU. So in order to verify whether it is related to RTX/RED we mangled the SDP and removed the RTX and RED “codecs” from the offer sent by our web client to the iOS application. And that stopped the video freezes.

Yay, time for waffles. Thank you Ingrid :-)

After that it was just a matter of rolling this modification out to the clients. Which took some time… also I got some very weird looks when explaining the SDP manipulations necessary to a coworker:

SDP mangling on the whiteboard

Not sure why, this makes perfect sense to me… :-)

What went wrong

The appear.in iOS (and Android) applications use libjingle_peerconnection cocoapod which is a very popular alternative to building the webrtc.org library yourself. Unfortunately, builds have stopped quite a while ago and we were stuck to the last revision we considered stable which had a version number of 10642 and is somewhat outdated. That kind of interoperability issue is not that common, in the past we have seen Firefox break DTLS interoperability with old versions of the native libraries.

We probably should have seen this earlier but it only happens when there is packet loss which rarely happens when testing locally.

Wait a second…

Thanks for bearing with me on a mind numbing topic of such boredom that I had to take a nap just writing this sentence (thanks Chris). I noticed another thing in the statistics:

The googBucketDelay value increases linearly when Chrome stops sending video (scroll up to the other picture). Now that was something that had been giving me a slight headache for a while. When Chrome 54 rolled out an old friend poked me about an issue he had during a video call and I found a similar increase in the bucket delay. So I filed a bug and checked if I was measuring how often this happened:

As we can see in the picture, this has been affecting some sessions since the middle of October. Enough to file a bug and monitor the situation but not an issue that was widespread enough that it warranted dropping everything else. And we didn’t hear many complaints about video freezes at the same time so this may have just been an artifact in the statistics.

I remembered that bug when I got Ingrid’s report and we were able to reproduce the issue. We have been continuing to monitor the situation since then. After we rolled out the workaround in our iOS application, the number of times this happened dropped down to a much lower number:

So a hotfix which disabled RTX/RED to counter a change in Chrome 56 removed an issue we have seen since Chrome 54 (and possibly earlier). However, the Chrome 54 WebRTC release notes indicate nothing related to RTX or RED.

There is a chance that this is caused by two different issues. The numbers we have been tracking since Chrome 54 show a very strong weekday-weekend pattern in Chrome 55 (red in the graph below; note the drop in the number of times this occurs every weekend) which is no longer present in Chrome 56 (yellow) and which is not present in our iOS app:

Being present all the time also meant this graph didn’t get any particular attention with the M56 rollout either.

It is not fully clear why this change would stop Chrome from sending video either. I would expect receiving the packet from Chrome in the iOS app to be the issue if this is a resent in a new format not understood by the old library. Maybe there is a WebRTC packet of death…