The WebRTC browser lifecycle

Chrome 53 is rolling out since the beginning of September:

Ever since working with WebRTC, I have dreaded browser upgrades. Usually something breaks. Yes, you can test with upcoming versions of Chrome (Beta and Canary) and Firefox (Developer and Nightly).I also strongly recommend to have automated tests running in those versions like we do for adapter.js. You probably also want to have some tests covering basic networking scenarios like “UDP is blocked”. Also, always prioritise reading the release notes which are posted to the discuss-webrtc mailing list.

And yet, you will run into issues when you WebRTC on a day-to-day basis. And your users will run into the same issues. I saw it happen too often in the past. In January 2015, a Chrome upgrade broke a DTLS handshake with Bouncycastle. In March 2015, a Firefox upgrade broke interoperability with native Android and iOS apps of several vendors. In December 2015 we saw an increase in ICE failures. January 2016: video desync. More recently, Firefox 47 broke TURN/TCP on Linux. Updates are not always bad though, Chrome 52 improved the Jitter buffer considerably.

The key thing to understand is the browser release cycle. To make things more fun, the Chrome and Firefox release cycles differ significantly. Lets look at Firefox first, with data I gathered in the last weeks from appear.in:

Firefox 48 slowly rolled out during August 2016. Which was great news because it added ICE restarts (well, mostly…) and we saw that behaviour rolling out in production. As the illustration shows it takes more than a month for the roll out and we still have 20% of users on Firefox 47.

Chrome is very different in this respect. I happened to start gathering full productions stats just when Chrome 52 started rolling out at the end of June:

During the first phase there is a roll-out of the new version to roughly ten percent of the calls. This is done for a week. The next week there is a roll-out to about 15 percent. During this phase, there is a chance to fix critical issues. This is followed by an increase to 30% for a short while and then the usage quickly jumps to 75% and then 90%. If you only notice your bug at this point… good luck!

Now Chrome 53 we start to see similar behaviour right now. And so far, nothing broke!

What do we not see here? The beta and canary versions. Where it is still possible to fix minor issues. On September 1st, when Chrome 53 started rolling out this looked as follows:

percentage | browsermajorversion 
— — —-- — + — — — — — — — — — — -
4.0% | 51
92.0% | 52
4.3% | 53
0.083% | 54
0.056% | 55

Lets say you have a bug which affects one in 250 sessions. If you look at daily numbers you need a significant number of occurences to be able to spot the bug, say 50. And one in thousand calls happens on Canary. So you need to run a call volume of 50 * 250 * 1000 or 12.5 million daily sessions to spot 50 instances of the bug in Canary. Aggregating over a longer time such as a week increases your chance to spot trends but lowers your reaction time. And keep in mind that your service might have a different usage pattern during the week than during the weekend.

It is quite possible to catch issues in these upcoming browser versions. This issue is a great example of a bug caught by an engineering manager dogfooding her product!