An online class being conducted during lockdown (Image by Greg Baker AFP via Getty Images)
An online class being conducted during lockdown (Image by Greg Baker AFP via Getty Images)

COVID-19 Influencing a New Era of AI-based Video Conferencing

Faiyaj Bin Amin
IEEE SB KUET
Published in
6 min readJan 12, 2021

--

The pandemic has compelled organizations to close their offices, schools to shut down and forced us to convert our real-life meetings and gatherings to virtual events. It has changed the way we work, everything has gone from on-site to online, our homes have become the new office, the new classroom. Hundreds of millions of people are holding conversations online every day through video conferences. But can anyone rest assured that the video call will not be interrupted due to poor bandwidth availability?

Globally, more than 1.2 billion children are outside classrooms and 88% of the organizations have made it mandatory to work from home according to Gartner, the world’s leading research advisory company. A recent study of more than 525 organizations by Nemertes found that 72% of the total workforce are now home-based, compared to just 34% before the pandemic. In these unprecedented times, video conferencing has emerged as essential technology, facilitating face-to-face interaction from anywhere and on any device at any time, without direct contact between people. For work, school, doctor visits and virtual events video conferencing has become the ultimate solution to connect with remote employees, students and patients.

The unprecedented surge in online meetings has brought with itself its own limitations and given rise to newer problems. The disruption during live video conferences has become the number one problem hindering e-learning and work from home. Slow internet and weak connectivity issues have been hampering the effectiveness of online meetings for users on both ends and the experience with meetings has not been smooth for most people. A reduction in video quality or resolution, lag on screen sharing and power point presentations, loss of sound quality, crackled and muffled noise, and participants getting auto disconnected are many of the problems that start occurring as soon as the supply of bandwidth gets narrowed to our currently available online conferencing platforms like Zoom, Microsoft Teams, Google Meet.

Today, video conferencing is the number one source of traffic on the internet. On the Microsoft Teams platform, a record-breaking 2.7 billion minutes of meetings were held in a day. But the effectiveness and quality of virtual meetings have been falling behind. Hopefully, the solution is not very far. A team of researchers from Nvidia has developed a revolutionary technology for video conferencing using advanced Artificial Intelligence (AI) that can reduce video bandwidth usage down to one-tenth of H.264 codec. This technology significantly improves video conferencing quality while reducing cost. It is called ‘Maxine’ and uses GANs from Deep Learning for video resolution improvement and video compression, background noise cancellation, real-time translation, transcription, facial alignment and many more using the least amount of bandwidth.

But what is Deep Learning and what are GANs?

Deep Learning is a particular sub-field of Artificial Intelligence that utilizes a hierarchical level of Artificial Neural Networks (ANNs). They are algorithms inspired by the biological neural networks present in the human brain so that computers can identify patterns and make decisions in a more human-like manner. The models are trained by ginormous amounts of data through neural network architectures that contain many complex layers underneath, hence it is called ‘Deep Learning’. Using Deep Learning computers mimic the thinking and learning process of humans. They repeatedly perform a task on images, sound, or text and tweak it a little more every time to improve the outcome of the experiment. Deep learning AI is able to learn without the supervision of humans from diverse, unlabeled, and unstructured data. But Deep Learning was not possible in the past. Presently, the 2.6 quintillion bytes of data that we generate online daily in combination with cloud-based hefty computing power that is readily available have made Deep Learning feasible. Deep Learning has given life to the image processing necessary for video compression during online meetings.

AI Video Compression in real time (Image by Nvidia)

Dr. Ian Goodfellow and his colleagues introduced Generative Adversarial Networks (GANs) in a paper at the University of Montreal in 2014. GANs are a deep-learning-based algorithmic architecture where two neural networks compete against each other to generate new data that can pass as real data. For example, GANs train on videos to generate new videos that look authentic. The GANs architecture consists of two deep networks or models known as the Generator model and the Discriminator model. Input data fed into the Generator model is used to automatically recognize patterns and generate new examples as output. After this, the Discriminator model decides whether the generated data is real or fake (generated) using a binary classification problem. The decision from the discriminator is obtained as 0 or 1 and passed to the generator as feedback. The generator model examines the distribution of data in ways that maximize the probability of fooling the discriminator. The final output is not only indistinguishable but also similar to the original sample. The two models keep training together in the zero-sum game until the discriminator model is fooled most of the time, which means the generator model is generating acceptable examples. GANs are used to generate text, images, or videos. But the scope of their application for optimizing video calls is far greater than this.

Figure 1: Flowchart of GANs operation as drawn by myself

Ming-Yu Liu and Arun Mallya are AI researchers at Nvidia who have collaborated with Ting-Chun Wang to utilize a neural network in place of video codec software to compress and decompress video for transmission over the internet. As a result, video calls are possible with one-tenth of the network bandwidth necessary on traditional platforms. In this process, the sender first sends a reference video frame as his initial image. Then an AI-based neural network analyzes the image to draw out and encode key facial points including the user’s eyes, nose, and mouth. Instead of sending an entire screen of pixels, only the data on the locations of the key facial points are transmitted, reducing bandwidth consumption to a minimum. This is much more efficient than compressing pixels and much less data is transferred over the network. GANs on the receiver’s end project key facial points onto the initial reference frame to reconstruct the following images. With Computer Vision techniques, the head of the person is located from many angles and aligned automatically to make it feel more like face-to-face conversations. Moreover, the well-trained neural networks along with specialized algorithms upscale the quality of video conversations by creating filler pixels for better resolution. The deep learning models are deployed in the cloud to facilitate the process for end-users. The entire technology has been trained for thousands of hours on Nvidia systems.

Processing of key-points from sender to receiver (Image by Nvidia)

Today’s companies are going through a transformative era and live video compression technology is in its renaissance phase right now. Avaya is the first company to have adopted Nvidia’s Maxine software development kit. Users of Avaya video conferencing software are already benefitting from the new technology. They enjoy background noise removal, green screen backgrounds, and discrete live transcription services. All major cloud service providing companies have also started offering the technology. Companies like Netflix are also striving to be industry leaders in AI-based video compression technology. The artificial intelligence technique adopted by them compresses and tweaks each scene individually without affecting quality, rendering streaming possible even on slow internet. The application of this technology stretches out to areas beyond video conferencing, such as the film industry, where deep learning is used to remaster old movies to higher resolution and colorize them.

Over the last decade, the progress in Artificial Intelligence and necessary graphics card like hardware has made the advancement in compression technology a reality. That day is not very far when video conferencing will not be dreaded anymore and meetings will proceed flawlessly. With further development of AI, deep learning is promising to reduce bandwidth consumption by many times in the near future. While most of this technology is in its early stage of development, it is hoped that engineers will be widely exploring the possibility of video compression using neural networks. Researchers and software developers will be able to further increase the performance and functionality of video conferencing software through an in-depth study of the technique. Online meetings are going to be uninterrupted and reliable before long.

--

--