Where to deploy TURN (or other relay) servers

I read an interesting article on NoJitter which described the way Cisco is deploying Spark. By rolling out media servers in data centers close to the users Cisco wants to ensure a sub-150 millisecond delay for anyone participanting.

Following up on Dag-Inge’s blog post which looks at connection setup time depending on TURN servers I wondered if we can measure something in the same spirit that tells us whether we are deploying TURN servers (and in the future SFUs) in the right locations or if we need to optimize our deployment. After all, as I wrote a while back, this kind of operations infrastructure is one of the most important aspects of running a WebRTC service.

Our current deployment strategy at appear.in is load-based. In every AWS data center available to us, we deploy one or more TURN servers, look at the load and fire up another intance if the individual instances get too much traffic. We then use route53 to route clients to the nearest TURN server by letting them resolve ‘turn.appear.in’.

But how fast and effective is that?

For peer-to-peer connections the time it takes to gather a relay candidate from a TURN server is a pretty good indicator of the round trip time between the client and that TURN server. This is easy to measure by looking at the delay between the RTCPeerConnection.setLocalDescription call and the first icecandidate event where the candidate type is ‘relay’.
Roughly like this:

var pc = new RTCPeerConnection({iceServers: […]});
pc.onicecandidate = function(event) {
if (event.candidate &&
event.candidate.candidate.indexOf(‘typ relay’) !== -1) {
console.log(‘took’, stop — start);
}
};
pc.createOffer({offerToReceiveAudio: 1}).then(function(offer) {
start = Date.now(); return pc.setLocalDescription(offer);
});

At the same time, it isn’t affected by clients doing calls across regions like measuring call setup time is. What happens under the hood is that (unless you are using some tricks like Facetime) the browser will ask the TURN server to create an allocation, get back a not-authorized error and ask again with credentials. So this takes two round trips (unless there is packet loss). The relay address in that candidate also gives us an indication of the relay server used. We did not gather that before but it was easy to add.

Since the data again contained quite a lot of outliers we’re cutting of things at 5000ms:

SELECT MIN(medianvalue) as median, datacenter
FROM(
SELECT relayaddress, datacenter,
MEDIAN(gatheringtimeturnudp) OVER (PARTITION BY datacenter)
AS medianvalue
FROM dataset LEFT JOIN turnlocations USING(relayaddress)
WHERE gatheringtimeturnudp < 5000) t1
GROUP BY datacenter order by median;
median | datacenter 
--------+------------
198.0 | Oregon
212.5 | San Jose
257.0 | Dublin
268.0 | Ashburn
276.0 | Tokyo
280.0 | Sydney
302.0 | Sao Paulo
313.0 | Frankfurt
332.0 | Seoul
406.0 | Singapore
442.0 | Mumbai

Intuitively we knew that running TURN servers close to users is a good thing so the results are pretty good. Remember that we are measuring two round trips so we can divide by two to get a (very) rough estimate for end-to-end latency.

Now we can measure to a degree where intuition does not cut it anymore. We can use these metrics to measure if deploying another TURN server in a particular region improves latency or if we should run servers in another datacenter. The Singapore and Mumbai datacenters are two examples of this.

One of the more odd results was that there is roughly 20ms higher delay in gathering candidates from Firefox than from Chrome. Pending further investigation that looks like there is an opportunity for Firefox to get faster :-)

Hear more about this at ClueCon in Chicago next week!