Not All CPaaS Are Created Equal

Alexey Aylarov
4 min readMay 6, 2016

Сloud communication platforms began appearing a long time ago. What started with REST API for SMS/MMS soon moved into the voice & video calling space. WebRTC has proven to be a big accelerator of this evolution; the project has helped wrangle the complex issues related to media processing on the client side and has provided free, open source technology (and the standard) that is still being worked on and improved every month.

Tech behemoths like Google, Cisco, and Microsoft have joined efforts to deliver a technology stack that developers like. We’ve been participating in the process as well, but when we started working on the VoxImplant platform in 2012, we decided that it was a good idea to be different from other players on the market. This not only meant being different in terms of our pricing / marketing, but also by delivering technology that we believed would really make a difference for developers.

Call processing

If you’re familiar with platforms like Twilio or Plivo, you know that call processing looks rather similar. You either write specific XML (TwiML or Plivo XML) or write the code that generates a specific kind of XML. Then, their backend communicates with your backend to understand what to do when a call comes in to the platform, either from PSTN or from SIP / SDK. (And those of you familiar with VoIP still remember VoiceXML.)

This approach can work, but we’d argue it has a number of critical downsides. These include:

  • A lot of unnecessary interactions between the platform’s backend and the developer’s webservice
  • If something goes wrong, it’s a heck of a lot more difficult to process the call in any way (or end it gracefully)
  • It requires generating XML, which is ugly (of course, the vendors do offer a number of wrappers for all kind of languages to help with that)
  • Debugging == checking requests logs
  • There’s a considerable lack of flexibility for complex scenarios
  • It requires more resources on the web service level
  • You’re faced with a longer time-to-market

We decided to use the power of a cloud application engine to avoid these problems and to provide developers with the flexibility they deserve. Keeping in mind how popular JavaScript has become, it really was a no-brainer for us. We chose JavaScript as the language in which call control scenarios are written. And then we gave our engine a name: VoxEngine.

Most web developers are plenty familiar with JavaScript and many of them are also familiar with Node.JS. In our case, though, it didn’t work exactly like Node.JS, since when a call arrives to the platform we need to start a session. And, in the context of this session, the developer decides what to do with the call (forward it somewhere, enable call record, make an http request to some external webservice to get data, etc).

A VoxEngine scenario looks like a standard JavaScript app. It has event handlers (events are fired asynchronously), and a full set of ECMA5 functions is available — as well as specific VoxEngine classes and functions that allow developers to control calls and use different platform features like recording, conferencing, etc.

See the example below:

This approach allows you to add business logic into your call control scenarios. It doesn’t mean you need to overuse this, though, since a VoxEngine session has its own limits in terms of how much memory and resources can be allocated for its processing. In many cases, you will realize that development efforts are significantly reduced and your time-to-market decreases.

Call processing happens on the media server in real-time and is controlled from the app engine’s side. Thus, we could implement a real-time debugger similar to a JavaScript debugger that you can use in any contemporary web browser.

VoxEngine Debugger

Independent control for each call leg

Within the context of a VoxEngine session, each call object represents the connection between the platform and some endpoint (pstn, sip, sdk) that can be controlled independently. Thus, you usually have one incoming call per session (if the session isn’t launched via HTTP API or if it’s not a conference session), and you can have any number of outbound calls. Both inbound and outbound audio streams of the call object can be controlled independently as well.

VoxEngine Session Example

Why flexibility matters

We started receiving requests from developers who wanted to implement particularly interesting and unusual scenarios where standard approaches offered by most CPaaS vendors wouldn’t work. VoxImplant’s flexibility and feature set has stepped in to help solve their problem. The list includes, for example, chained audio conferences for walkie-talkie-like services.

While it might take a little upfront time to learn about VoxImplant’s architecture and development patterns, developers are soon using the platform to (quickly) implement very complex scenarios. We believe that cloud platforms for 21st century developers must be both powerful and flexible, and we will continue adding functionality to continue to push VoxImplant further in that direction.

--

--