Catching TURN bugs with monitoring

I spent quite some time data nerding recently. It is my spare time project, and doing so helps me filing bugs. One example of the results I get from this is a recent regression in Firefox 47 on Linux that affected how Firefox gathers TURN/TCP candidates. The bugzilla entry has the full story including the rapid response by Mozilla (thanks Nils for fixing it).

One evening I pasted the following snippet into the JS console:

var pc = new RTCPeerConnection({iceServers: [{urls: 'turn:turn.appear.in:443?transport=tcp], username: ..., credential: ...});
pc.onicecandidate = function(event) {
console.log(event.candidate ? event.candidate.candidate : null);
};
pc.createOffer({offerToReceiveAudio: 1})
.then((offer) => pc.setLocalDescription(offer));

What this does is that it creates a peerconnection and gather TURN candidates from one of our TURN servers. Now… it didn’t work on my Linux machine. I grabbed another machine (Windows) and it worked just fine. On the Linux machine it worked when specifying the IP address instead of the hostname. Maybe it’s just me?

Checking our platform metrics (specifically gatheredTURNTCP from previous blog posts) it turned out that on Linux only very, very few Firefox calls gathered a TURN/TCP candidate. The failure rate was about 90%. On Windows and OSX it is less than five percent. Clearly something is wrong here. Quite hard to detect since we have a pretty low number of Firefox calls and the number of Linux users that also require TURN/TCP to work… the fraction of a fraction of a fraction of a pie. But if the pie is big enough…

This is the point where I filed a bug and poked people on IRC. The #media channel on irc.mozilla.org gives you pretty good response times. After some back and forth in the bug tracker it turned out to be a regression which surfaced in Firefox 47. In hindsight, this turned out to be very visible. Kudos to getstats.io for providing us with the data for this graph:

The error rate for Firefox 46 on Linux (shown in blue) is pretty normal. When Firefox 47 rolls out around June 7th we see a pretty constant error rate of roughly 100% on Linux (orange). On Windows, the error rates for Firefox 46 and 47 are similar. Note that this does not mean that the call failed. Just that no TURN/TCP candidate was gathered.

It took almost two months to get this reported back to Mozilla from the point where was visible in this graph. But as a wise data scientist (hi Eric!) once told me: “I can’t have you and Gustavo look at graphs all day”.

Yet another example of how monitoring WebRTC improves your service. Success in that field is measured in bug reports to the browser vendors.