How to build a scalable and flexible VoIP architecture

Published in

Making Tuenti

5 min readDec 22, 2017

At Calling, our job is enriching customers’ communications by providing innovative call services through the cloud. And we do it end to end.

We spend a lot of our time every quarter trying to improve the overall call experience. We want our customers to have a fluid and engaging experience when they make calls, whether it’s pure VoIP (app2app) or bridging from VoIP to GSM (VozDigital).

However, from time to time, we shift our focus slightly by taking a look at the state-of-the-art of our VoIP infrastructure or seeing if we can create a new killer-feature on top of it. And we managed to do just that when we developed Proofs of Concept (PoCs) of features like call saving or call filters, that went on to become full fledged products released into the wild. And we did it again last quarter, when we decided it was time to work on a PoC for video calls in Tuenti.

As usual, it started with some market research that showed us that a considerable percentage of all calls over IP (using either mobile data or WiFi) were actually video and, what’s more, that our customers really wanted video calls, badly.

Afterwards, with that data in hand, we started talking about architecture. We should also say we took this opportunity to look into what we wanted our VoIP architecture to look like in the next few years.

Thus, we ended up deciding to keep the video call PoC completely decoupled from the VoIP calls. This means that, for the time being, VoIP and video calls have different and completely separate architectures.

With the new architecture, we decided at the outset to keep all the logic as well as the current call state server-side. The client-side logic only handles what’s actually local to the device, such as things like permissions management, minimum required network verification and ongoing GSM calls monitoring. This way we kept the common logic shared among our (Android, iOS and Web) clients in a single place, without the need of waiting the release cycles of clients to update it if needed. The result is a “dumb” client that just sends actions to and receives state changes from our back-end.

On the client-side, we split the core of our video call implementation into two main components, one called “VideoPhone” and the other called “VideoClient”.

The VideoPhone can be seen like a physical phone device you might have on your desk. All the logic and high-level steps needed for making and receiving calls happen there. However, this knows nothing about what’s actually going on underneath. On the other hand, the VideoClient can be seen like the cables that go from the phone device to the wall. An that’s where all the technical details and low-level steps specifically involved with signalling and the media transport implementation are.

This architecture gives us a lot of flexibility, because the business logic does not depend on the actual signalling and media transport implementation we’re using under the hood. In other words, we’ve made it possible to change the cables without touching the phone device.

With that in mind, and since we wanted a working PoC up and running with a minimum of time and resources, we decided to initially opt for the OpenTok video platform from TokBok as our media transport implementation.

As said, thanks to our architecture, it’s easy to change the signalling and media stack if we need to. For example, if tomorrow we decide to replace OpenTok with our own VozDigital WebRTC-based stack, we could do it without having to change any of the business or presentation logic above it.

On server-side, we created a micro-service that handles the call state transitions. When a client sends an action, the micro-service retrieves the current state of the call and perform a corresponding use case, which depends on the action sent and the current call state. If the desired action can be performed the micro-service will update the call state and broadcast the new state to all the peers of the video call, otherwise an alert will be raised.

For now, we’ve chosen to store the current state of the calls in Memcached. This way it can be shared among several instances of the same micro-service, therefore achieving horizontal scalability.

Whenever a call state change happens, like a peer answering a call or enabling/disabling audio or video, the server notifies all the peers by sending a state-update signal to the clients.

Clients only receive state-update signals. They don’t emit them, ever. However, they do store in a cache the last call state received, so that it can be replayed to the presentation layer at any time.

Here it follows an high-level, rough example were Alice makes a video call to Bob who answers it.

Finally, when server sends the last state-update, which is the End call event, this also includes details like the terminate reason (e.g. Success, Busy, Cancel, Decline, …) and the corresponding actions available to the user (e.g. retry call, offer GSM fallback, offer chat/sms message or offer to rate the call).

Here again, clients just present what the server tells them.

In conclusion, we’re still experimenting and testing the video call PoC. We’re working on making sure we made the right architecture choices and validating the integration of video calls with the rest of the app’s features. But, so far, the results are looking pretty good… ;)

How to build a scalable and flexible VoIP architecture

Written by Paolo Rovelli