Catching TURN bugs with monitoring

Philipp Hancke
Sep 23, 2016 · 2 min read

I spent quite some time data nerding recently. It is my spare time project, and doing so helps me filing bugs. One example of the results I get from this is a recent regression in Firefox 47 on Linux that affected how Firefox gathers TURN/TCP candidates. The bugzilla entry has the full story including the rapid response by Mozilla (thanks Nils for fixing it).

One evening I pasted the following snippet into the JS console:

var pc = new RTCPeerConnection({iceServers: [{urls: 'turn:turn.appear.in:443?transport=tcp], username: ..., credential: ...});
pc.onicecandidate = function(event) {
console.log(event.candidate ? event.candidate.candidate : null);
};
pc.createOffer({offerToReceiveAudio: 1})
.then((offer) => pc.setLocalDescription(offer));

What this does is that it creates a peerconnection and gather TURN candidates from one of our TURN servers. Now… it didn’t work on my Linux machine. I grabbed another machine (Windows) and it worked just fine. On the Linux machine it worked when specifying the IP address instead of the hostname. Maybe it’s just me?

Checking our platform metrics (specifically gatheredTURNTCP from previous blog posts) it turned out that on Linux only very, very few Firefox calls gathered a TURN/TCP candidate. The failure rate was about 90%. On Windows and OSX it is less than five percent. Clearly something is wrong here. Quite hard to detect since we have a pretty low number of Firefox calls and the number of Linux users that also require TURN/TCP to work… the fraction of a fraction of a fraction of a pie. But if the pie is big enough…

This is the point where I filed a bug and poked people on IRC. The #media channel on irc.mozilla.org gives you pretty good response times. After some back and forth in the bug tracker it turned out to be a regression which surfaced in Firefox 47. In hindsight, this turned out to be very visible. Kudos to getstats.io for providing us with the data for this graph:

The error rate for Firefox 46 on Linux (shown in blue) is pretty normal. When Firefox 47 rolls out around June 7th we see a pretty constant error rate of roughly 100% on Linux (orange). On Windows, the error rates for Firefox 46 and 47 are similar. Note that this does not mean that the call failed. Just that no TURN/TCP candidate was gathered.

It took almost two months to get this reported back to Mozilla from the point where was visible in this graph. But as a wise data scientist (hi Eric!) once told me: “I can’t have you and Gustavo look at graphs all day”.

Yet another example of how monitoring WebRTC improves your service. Success in that field is measured in bug reports to the browser vendors.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store