So I read that 20% of WebRTC calls fail

No way. I have talked about the topic of dealing with failing WebRTC connections in the past. It is a sufficiently serious problem that you need to deal with it. But 20% (which was later corrected to 12%) was enough for me to make fun of the WebRTC project co-founder Serge Lachapelle. If these numbers were true, his team would have done a bad job. Well… they did not. Let me show you.

Data nerding with WebRTC

I learned a thing from Badri Rajasekar: be data-driven. Also, he dislikes Chrome updates breaking stuff again, and again and again as much as I do and allowed me to build the tools to ‘fight back’. For this, we are instrumenting data gathered from RTCPeerConnection API and the getStats API.

First, you need a sufficiently large enough number of calls to look at. I used 100.000 calls as sample size, only including calls that had either reached a ice connection state of ‘connected’, ‘completed’ or ‘failed’.

create or replace view dataset as select * from features
where iceconnectedorcompleted = ‘t’ or icefailure = ‘t’
order by date desc limit 100000;
select icefailure, count(*) from dataset
where iceconnectedorcompleted = ‘t’ or icefailure = ‘t’
group by icefailure
order by count desc;
icefailure | count
— — — — — + — — — -
f | 93303
t | 6697

Tooling is everything. Being able to analyze your platform performance with a bunch of SQL queries is really helpful. And allows data scientists to ask very interesting questions.

So 6.7% of calls failed’. Even though that is lower than the 12% given in the article, this is far higher than what I expected.

Our definition of ‘icefailure’ is ‘does the ice connection state ever change to failed’. Now… welcome to the land of WebRTC bugs. We did see a big increase in ICE failures last december when Chrome 47 rolled out. The issue has still not been resolved. Hence we need to distinguish between calls that worked and calls where the ice connection state goes to failed after being connected or completed.

select icefailure, icefailuresubsequent, count(*)
from dataset
where iceconnectedorcompleted = 't' or icefailure = 't'
group by icefailure, icefailuresubsequent
order by count desc;
-----------+----------------------+-------
f | f | 93303
t | t | 4797
t | f | 1900

So the number of subsequent failures is pretty large. There are a number of possible explanations for those. One is that the other client went away silently. And things might have started working again after doing an ice restart. Mixing this up with initial failures skews the picture. We will ignore that data for the time being.

Subtracting that we get 1.9% of calls that fail. A sixth of the number quoted in the article. But still… too many. Even if we consider that our data set contains both caller and callee so users probably see about half of that.

How many of those ICE failures can be explained, e.g. by determining that a client was on a network which blocked access to the TURN servers? I’ve talked about this in the techtoks talk mentioned earlier. At the API level (where we gather data) this can be determined by looking whether relay candidates were gathered by either side. We can even (for the local side) determine whether we gathered a relay candidate via TURN/UDP, TURN/TCP or TURN/TLS. For the remote side, determining the type of the candidate is not that easy (not impossible but…) so we are just interested in whether there was a remote relay candidate:

select gatheredTURNUDP, gatheredTURNTCP, gatheredTURNTLS,
hadRemoteturnCandidate, count(ICEFailure)
from dataset
where ICEFailure = 't' and ICEFailureSubsequent = 'f'
group by gatheredTURNUDP, gatheredTURNTCP,
gatheredTURNTLS, hadRemoteturnCandidate
order by count desc, hadRemoteturnCandidate;
----------+----------+----------+----------+-------
f | f | f | t | 544
t | t | f | t | 318
t | t | t | t | 316
t | t | t | f | 299
t | t | f | f | 170
f | t | t | f | 70
t | f | f | t | 52
f | f | f | f | 42
f | t | t | t | 27
f | t | f | t | 15
t | f | f | f | 14
f | t | f | f | 12
t | f | t | t | 9
f | f | t | t | 6
t | f | t | f | 5
f | f | t | f | 1

Lets look at the big numbers here. If the local or remote side did not gather TURN candidates, one (or both) browsers are in an environment where WebRTC can not work. That is true for all cases where hadRemoteturnCandidate is false as well as the case where all of gatheredTURNUDP, gatheredTURNTCP, gatheredTURNTLS are false.

select hadremoteturncandidate, 
(gatheredturnudp or gatheredturntcp or gatheredturntls)
as hadlocalturncandidate,
count(*)
from dataset
where ICEFailure = 't' and ICEFailureSubsequent = 'f'
group by hadremoteturncandidate, hadlocalturncandidate
order by count desc;
hadremoteturncandidate | hadlocalturncandidate | count
------------------------+-----------------------+-------
t | t | 743
f | t | 571
t | f | 544
f | f | 42

Together, that amounts for 1157 calls or 60% of the failures. There is not much we can do about this, the environment is blocking the browser from doing WebRTC. One common scenario is a proxy which requires authentication, which is a pretty big issue in enterprise environments. Chrome does not implement this and we suspect that fixing issue 439560 would help in some of those cases. How many… we don’t know. We would like to be able to give feedback with concrete numbers (because doing that helps a lot in any discussion) but there is no way we can measure this.

So there are still 40% of failures that we haven’t explained yet. In particular, two rows stand out:

t        | t        | f        | t        |   318
t | t | t | t | 316

Both TURN/UDP and TURN/TCP work. That should not happen…

Another advantage of our data collection is that we automatically get a full dump that is equivalent to the data that you get from chrome://webrtc-internals. So I went ahead and looked at some of the data. With mixed results. At least some of these sessions were labelled wrongly because the initial connection attempt did not have relay candidates and those only appeared after an ice restart. Other logs turned out to be quite worrisome with issues like a fifteen second delay between the setLocalDescription call and the first ice candidate.

This isn’t unexpected, all the things are easy to diagnose have been fixed already in the last years. What is remaining is a bunch of problems that happen only very rarely and are hard to reproduce. Even when they happen, how should fixing them be prioritized?

tl;dr;

So what initially looked like 6.7% of failures boiled down to much less. And is far from the 12% claimed. Talking about such numbers is very difficult. Without a lot of context, they don’t make sense and are easy to misinterpret. As such, please don’t quote me.

Also I bet our numbers differ from those of other services because we use WebRTC in a certain way. We use full mesh and peer-to-peer connections. How do those numbers change if you have all sessions relayed via an SFU? Let Gustavo from Tokbox tell us…

Do you want to work on improving those numbers? Join me at appear.in!