Illustration created by Yufei McLaughlin

The future of meetings is here (part 1): how Machine Learning is applied in RTC optimization and computer vision

Yufei M
ringcentral-ux
5 min readFeb 24, 2020

--

It’s been almost 6 months since I started working on an RTC (Real Time Communications) product. In that time, I was fortunate enough to attend a few conferences to observe what’s happening in the industry and what RTC’s future looks like. I realized that the future of RTC is actually happening right now. With the incorporation of A.I. — once considered the next-generation technology — RTC hardware and software are revolutionizing enterprise team communication and collaboration. In this article and the next, I will share some findings on how this is happening.

In the past few years, RTC vendors have been leveraging new Machine Learning (ML) techniques, like Deep Learning and Neural Networks, to improve productivity in video conferencing. These new algorithms can learn complex models from rich data on their own instead of relying on linear rules and well-structured input and output. They produce more accurate results in many RTC areas, such as RTC optimization, computer vision, speech analytics, and voice bots. These improvements, in turn, boost enterprise productivity and improve team collaborations — saving our time and consuming less of our energies and efforts to get things done. We can see these benefits before, during, and after meetings.

RTC optimization

Illustration created by Yufei McLaughlin

The first and most direct area that Machine Learning is influencing is in building models for optimizing media quality (both voice and image). Improvements in voice and image quality sent over a choppy network is a critical differentiator for RTC vendors. With new ways of ML, RTC products could now know how much available bandwidth there is based on the packet loss reported on the network and know what filter to use over the audio coming from the mic to get rid of background noises. On the image optimization, the new models not only compress, encode, decode but also decide what to do next or replace the image it has based on machine learning.

The quality improvement, such as noise suppression, packet loss concealment, bandwidth estimation, and real-time image correction, is not another checkmark in a long list of new features, but they are the baselines of competitions. Optimizing 2x image resolution over the same devices and optimizing audible conversation from bad network are the keys to pull ahead of competitors.

Computer vision

Illustration created by Yufei McLaughlin

In the past decade, visual/video traffic has been driving the majority of traffic on the internet. Visual consumption has transformed internet usage.

According to Cisco VNI Report, consumer Internet video traffic will be 82% of consumer Internet traffic by 2022, up from 73% in 2017 (globally) while business Internet video traffic will be 73% of business Internet traffic by 2022, up from 56% in 2017 (globally). On top of these, live Internet video traffic will globally increase 15-fold between 2017 and 2022 (72.7% CAGR).

For the end-user, messaging applications such as Instagram, Facebook Messenger, and Snapchat are leading examples of computer vision technology. ML-based filters allow users to put objects such as bunny ears or mustaches on the video once the person’s face is located, adding a new emotional dimension to communications.

In the context of RTC, computer vision also plays an important role and with the incorporation of ML, there has been a diverse range of computer vision technologies. ML has been able to process, analyze, and understand what’s seen in the images and videos, from low-level image and video processing to high-level 3D understanding. Microsoft Skype and Teams utilize this technology by introducing background blurring. Instead of adding virtual objects to the images, Microsoft subtracts content and then filter and blur the background.

In the business use cases, computer vision is applied to more and more RTC subjects. Examples include:

(1) Image manipulation in real-time

In addition to Facebook’s adding objects to the images and Microsoft’s background blurring, we see Zoom replacing the background all together with a new photo or video and Cisco identifying people and putting their name tag in the video.

(2) Object detection

Image classification with a machine learning model can identify what’s in an image, but it can’t identify where it is in the image. To achieve this, a different architecture is needed to create advanced recognition capability. And the ability of real-time object detection and providing the location of multiple objects within live RTC video streams has taken RTC to a new level. For example, we can adjust the font on a browser screen to the distance the user’s face is from the monitor.

(3) Facial recognition

With the ability of real-time object detection, RTC vendors started to use face detection for identification by inserting AI influence into a live video stream. In a classroom use cases, facial identification provides the ability to judge if students are engaged or not; in the meeting room, it can count in-person attendees to help understand utilization of the rooms. With the new ways of Machine Learning, when a user is speaking in the room, it could even auto-zoom to highlight the face of the active speaker.

(4) Scene understanding

Object detection not only offers capability of counting objects in the scene, but also track motion and locate an object’s position. By observing people’s movement like sitting, standing, and moving, RTC vendors can now understand users’ movement and environment and optimize the video accordingly.

(5) Others

With the use of RTC data channel transportation has been helping the possibilities of many new & unique use cases, such as animoji, mixed/augmented reality, gaming, and tele-surgery. This technology enabled synchronization between real-time data flows and video and audio streams.

In Apple’s case, a user can emulate facial expressions and just use these animojis instead of real video streams, and another user will see the Animoji instead of camera stream. In tele-operation, the synchronization of real-time data flows and video streams allows a remote surgeon to conduct surgery over video with a robot. In online gaming business, RTC data channel transportation enables peer-to-peer multiplayer games (data is passed back and forth in real-time between players and servers). Furthermore, with the integration of virtual reality and RTC, a developer can create fully immersive experiences for conference rooms or classrooms. Instead of using a mouse to control and attend meetings on a 2-D monitor, now users can wear a VR headset to see everything presented and tracked via head movement, including other people in the room, screen sharing, and other materials shared within the confines of the space.

In the next article, I will summarize how ML is applied to speech analytics and voice bots. Stay tuned!

(Please click here to read “The future of meetings is here (part 2): how Machine Learning is applied in speech analytics and voice bots”)

--

--