Scaling OpenVidu

Rodrigo Botti
Nexa Digital
Published in
6 min readJun 1, 2020

Using AWS & OpenVidu Pro — Part 2

Camera in the foreground (focused) filming a woman in the background (unfocused). Behind the woman there’s a book shelf.
Video, Camera, Optics, Photography, Shooting, Film by StockSnap

Recap

Last time we had a small introduction to OpenVidu, how we deployed OpenVidu Pro to our AWS account, it's cluster architecture and it's scalability features.
After that we figured out a way to add auto scaling capabilities to its media nodes by using OpenVidu Pro's REST API and some features and services from AWS.

Introduction

This time we'll actually look at some of the code of the auto scaling solution. I'll try to explain in-depth some of the code bits and configuration values used.

The code used will spam these technologies:

  • AWS CloudFormation — infrastructure-as-code tool for creating AWS resources
  • Node.JS — the language/platform of our lambdas
  • AWS CLI — just a small bit for customizing the media node launch script
  • Serverless Framework — for deploying our lambdas and their event subscriptions

Upscaling

Upscaling high-level architecture diagram

From the last part we know that the upscaling solution consists of the following:

Customized Launch Script

To customize the launch script we're gonna need three things:

  • private IP of the OpenVidu Server EC2 instance
  • ssh access to the instance
  • know where the script is located in the instance

Note: we need the private IP because we changed its Security Group so that we only allow ssh access within our VPC and VPN. OpenVidu does not do that by default when it creates its Security Groups through its CloudFormation Stack so if you have its public IP and/or DNS you might still be able to ssh by those but it is not advised to leave ssh access open to the internet.

OpenVidu Server information from AWS EC2 Console

Now we can access the server

ssh ubuntu@$PRIVATE_IP -i $PATH_TO_KEY

Once there we have to edit the script that the server uses to launch the media nodes which is located in /opt/openvidu/cluster/aws/openvidu_launch_kms.sh.

vim /opt/openvidu/cluster/aws/openvidu_launch_kms.sh

Now we can enable the detailed CloudWatch monitoring every time a media node is launched:

Customized media node launch script

Now every time a media node is launched — either by the OpenVidu inspector web UI or by the REST API — the media node instance will have detailed CloudWatch monitoring enabled.

Media Node from AWS EC2 panel — detailed monitoring enabled

CloudWatch Alarm

Now we can create the infrastructure elements of our alarm system:

  • CloudWatch Alarm: monitoring average CPU usage
  • SNS Topic: alarm event destination
  • Permission: gives our topic permission to trigger the upscaling lambda when receiving the alarm event

We will be using a CloudFormation template to define our resources.

Note: it is not in scope of this story to explain how to organize and deploy CloudFormation stacks. At Nexa Digital we generally use nested stacks and this is the case for the telemedicine project — the alarm resources are defined in a nested stack template of the telemedicine project.

CloudFormation template creating the alarm's infrastructure

The alarm is triggered after observing that the average CPU usage is above 50% during three consecutive collections of one minute each.
The alarm configuration will depend on a lot of factors that are specific to your business, expected load and media node instance configuration.
In our case:

  • we do not expect bursts/spikes of load. Thankfully given the nature of the business the number of sessions should increase and decrease smoothly or stay the same during office hours therefore we can afford 3 minutes of telemetry collection before scaling the cluster
  • the media node instances we use are c5.2xlarge instances. In our tests, each should be able to withstand a maximum number of 100 concurrent sessions and since we're not expecting a burst of connections as said before having 50 sessions per node — 50% CPU usage — of leftover capacity during load increment is more than enough
CloudWatch Alarm from AWS CloudWatch console

Lambda

Now we can code our lambda responsible for triggering the upscale using OpenVidu Pro's REST API.

Note: the following code is written in Javascript and relies heavily on a functional programming library called Ramda, hopefully the code won't look so foreign to you, also I will try to comment as much as possible. One other thing, the complete code is private intellectual property, that's why I'll only be showing incomplete code snippets and not uploading the complete source to a public github project.

Let's recap our upscaling logic:

code snippets of the lambda responsible for launching new media nodes

Now that we have the code we can use the serverless framework to create the lambda and the subscription to the alarm topic.

serverless.yml of the upscaling lambda

Downscaling

Downscaling high-level architecture diagram

From the last part we know that the downscaling solution consists of the following:

  • Periodic CloudWatch Event triggered every day after the system's "office hours"
  • Lambda that downscales the cluster to a single media node — the one with most sessions
  • Media nodes are deleted using the when-no-sessions deletion strategy — "if there’s any OpenVidu session initialized inside this Media Node, then it will not be immediately deleted, but instead will be set to waiting-idle-to-terminate status. This status means […] no more sessions will be initialized inside of it. Whenever the last session of this Media Node is destroyed (no matter the reason), then it will be automatically deleted."

Fortunately we can define both in the same project thanks to the serverless framework i.e. we don't have to define the CloudWatch event separately in a CloudFormation template, we can define it directly in the serverless.yml template.

We will be using a single source code repository to host both our upscaling and downscaling lambdas.

code snippets of the lambda responsible for launching new media nodes

Now that we have the code we can use the serverless framework to create the lambda, the periodic CloudWatch Event and the trigger to call the lambda.

serverless.yml updated to include de downscaling lambda

Conclusion

We managed to create an autoscaling solution to OpenVidu's media nodes by using AWS features and OpenVidu Pro's REST API.

If you are using OpenVidu in production, I hope this helps shed some light into how to achieve the elasticity for your business needs. I've shown a very specific scenario where elasticity is being applied, but hopefully the techniques applied here can give you a starting point to creating your own autoscaling solution.

Comments

Before using this solution in production I thought it was a good idea doing an end-to-end test to this system. The biggest challenge to doing that was triggering the alarm — given that the media node instances are very powerful when it comes to CPU and that simulating video streaming load would require connecting multiple clients in multiple sessions — . In order to get around that, what I did was:

  • Lowered the alarm threshold from 50% to 20% — current production load to the one media node was about 5% which means I would have to simulate a spike of 15% CPU usage
  • Instead of creating a bunch o video call sessions I connected to the media node using ssh and ran a script to consume a bunch of CPU

Once the alarm tripped, thankfully a new media node was launched and the average CPU usage reported by the alarm metric was halved i.e. it was a success.

Once again, thank you for reading!

I would also like to thank Adrian Shiokawa and Thaissa Candella for the support and for proofreading and commenting on the article before its publication.

If you haven't, please check Part 1 of this series where I explain the motivation and thought process for designing this solution.

--

--