Five challenges for high availability of real time media in the cloud
We all know that enterprise services and applications are moving to the cloud. At this stage it is not a matter of if, but when. However, before real time communications including voice, video and contact centre applications can move completely to the cloud there are some challenges that need to be addressed. In particular when we are trying to deliver real time two-way voice and video using Real-Time Protocol. Traditionally RTP failover works by sharing a virtual IP across multiple servers. Typically a pair of servers / appliances work in Active-Standby mode sharing state information and using VRRP to manage the virtual IP address. When the active node fails the standby takes over the IP address and continues to route all the media flows. When we try to move this model into the cloud we run into a number of challenges.
- Security Controls: Virtual Router Redundancy Protocol (VRRP) tends to be blocked by security controls. Cloud providers, even for private clouds, don’t like letting multiple NICs have the same MAC address or to change MAC addresses while active.
- Physical Separation: To ensure high availability we must ensure that we each instance runs on a separate physical server instances, we need to ensure that the cloud provider supports anti-affinity settings for servers. The problem here is it is not uncommon to suffer an outage that impacts multiple servers within a cloud provider instance, such as the recent Amazon Web Services UPS fault in Sydney.
- Logical Separation: One way to overcome the above challenge is to use separate Zones from the same cloud provider. This reduces the risk of faults related to a single data centre like the above but it doesn’t remove the risk of a fault in the cloud control layer causing outages such as the 18 minute disconnect of all Google Compute Engines in April this year.
- Control Layer Separation: The only way to avoid the above is to spread the server instances across multiple cloud providers, ensuring control layer separation, however this prevents a significant interconnect problem. In fact all the anti-affinity measures suffer from this challenge, how you do quickly and reliably share a layer 2 network and state information across the nodes.
- Synchronisation Connections: The synchronisation of state between the high availability nodes typically requires low latency links with minimal jitter. This is challenging enough to achieve when you have direct control of the network routing elements but as you move into shared infrastructure on the cloud the HA systems will need to be reengineered to operate with much higher tolerances.
These are just some of the challenges faced when delivering real time communications in the cloud. I’ll investigate more in coming posts.
Views expressed here are my own and do not necessarily reflect the views of my employer Oracle.