Working around TURN bugs

Chrome 58 contains a bug which makes TURN/TCP and TURN/TLS almost unusable. Here is how to work around it and how to measure the impact.

My rule of thumb is that every second Chrome release is bad. After the issues I had with Chrome 56 there was Chrome 58 which came with a crash so severe that I got my own minor release and a CVE number. And then it turned out that the bitrate estimator was broken when TURN was used over TCP/TLS.

“So you broke TURN again…” — I admire this Svalbard reindeer for its serenity

The bug was reported on the discuss-webrtc list on May 22nd. The ticket filed gave nice and easy instructions for reproducing the issue too. And the effect is clearly visible both in the statistics as well as the video picture.

The number of bits sent per second drops from 200kbps to less than 50kbps instantly. The video is quite pixelated and delayed.

The fix did not take long (thanks Taylor!) and came with a suggested workaround to remove the transport-cc feedback (which is part of the new send-side bandwidth estimation) from the SDP. The fix is currently under consideration for uplifting to Chrome 59 and Chrome 58 which is the current stable version. If that happens, no workaround is needed for those versions at least.

There are two variants of the workaround: the first way is to munge the SDP in any offer you send to a peer or receive from a peer. That will fall back to the previous REMB estimation which is still pretty good.

The more complicated workaround is to only do this fallback when it’s necessary. To do so you need to check what type of connection you are using. This piece of code shows you what to do in the iceconnectionstatechange handler:

pc.oniceconnectionstatechange = () => {
if (pc.iceConnectionState === 'connected' /* or completed */) {
pc.getStats(null)
.then(stats => {
stats.forEach(report => {
if (report.type === 'transport') {
let pair = stats.get(report.selectedCandidatePairId);
let local = stats.get(pair.localCandidateId);
let priority = local.priority;
let typePreference = priority >> 24;
console.log(priority, typePreference);
}
})
});
}
}

If the local connection turns out to be using TURN/TCP (which is indicated by a type preference of 1) or TURN/TLS (type preference 0) then you need to renegotiate. If you were offering initially you basically do this:

pc.createOffer((offer) => {
let sdp = offer.sdp.replace('a=extmap:5 http://www.ietf.org/id/draft-holmer-rmcat-transport-wide-cc-extensions-01\r\n', '');
return pc.setLocalDescription({type: offer.type, sdp: sdp})
})
.then(() => {
// use the current remote description as new remote description.
// Not ideal but requires no communication.
return pc.setRemoteDescription(pc.remoteDescription);
})

Not pretty but it worked out quite well for us.

Now… Chrome 58 rolled out in the end of April. Which means that bug was not spotted in the stable version for almost a month. Which brings us to my other favorite topic, playing with data gathered by rtcstats (which, tada, is now open source).

The bug shows as a very low send bandwidth. The most promising feature turned out to be the mean value of the googTransmitBitrate statistic for any sessions that use TURN/TCP or TURN/TLS and therefore have a local type preference of 0 or 1. I came up with this terrifying (sic!) query:

snoop=# SELECT
timestamp 'epoch' + date / 1000 * interval '1 second' AS day,
MIN(medianmean) as mean, browsermajorversion
FROM(
SELECT date, browsermajorversion,
MEDIAN(bwegoogtransmitbitratemean) over
(PARTITION BY date, browsermajorversion) AS
medianmean
FROM features_permanent
WHERE iceconnectedorcompleted = 't' and
(firstcandidatepairlocaltypepreference = 0 or
firstcandidatepairlocaltypepreference = 1) and
bwegoogtransmitbitratemean is not null and
browsername = 'Chrome' and
browsermajorversion >= 56 and browsermajorversion <= 58)
GROUP BY day, browsermajorversion
ORDER BY day DESC, browsermajorversion ASC

While the daily median of a mean value is typically very hard to interpret, it showed a quite significant difference:

Both Chrome 56 (blue) and Chrome 57 (black) had a much higher mean bandwidth usage. Chrome 58 (green) is suspiciously low even as a mean bandwidth usage of 30kbps means there not much transmitted. In hindsight the issue is pretty obvious in the data but this is not a metric I look at usually either.

The much more interesting question is if the workaround described above is effective. For that I needed to go down to hourly resolution:

As you can see the mean bandwidth was going up significantly from 50kbps to 100kbps after the workaround was deployed around 7am on that timescale.

For some bugs there are workarounds. Typically they are neither pretty nor easy to implement as we have seen in the past. But they shorten the time users are affected by a bug which is what matters in the end.

Visit the appear.in blog to read more about how we tackle those itchy bugs! And be kind and 💚 this story if you liked it!