Sentiment analysis with Agora’s video call
Sentiment analysis is an integral part of understanding human behavior and reactions to conversations.
In simpler terms, understanding how a person reacts to the conversation. Each individual can react to a conversation textual or verbal in a number of ways, based on their beliefs, mindset, and other individual preferences.
With a video call, understanding and analyzing an individual’s reaction becomes much simpler. This is made possible by facial features and muscular reactions such as smiling, frowning etc.
Check out the live demo here.
Architecture breakdown
The sentiment analysis integration is a follow up of the 1-to-1 Video call tutorial using Agora’s JavaScript SDK. It can be broken down into the following parts:
Canvas
Canvas is an HTML element that is used to draw graphics as shown below.
We’ll be making use of this canvas element to create a copy of the video call for further processing. The canvas allows us to break up the video into frames and send individual frames for processing to the server.
API Setup
The sentiment analysis algorithm is a deep learning model built using Keras and uses Tensorflow in the backend. The main purpose of the algorithm is to identify the face and then the emotional reaction in the frame.
The server is a programmed in Python and hosted on a cloud service. We’ll walk through each step of the integration.
Integration
So how does all this integrate with the previous demo project? Let’s have an in-depth look at the overall model.
Let’s get started
Canvas setup
The video stream has to be copied onto a canvas element, this is done to take timely snaps of the video stream and send it to the server and run through the model.
We create a simple div
with id="canvas-container"
that will display the video streams rendered on the canvas.
Next, we need to get the video stream onto a canvas element and add it to the above created div id="canvas-container"
. The script below does the same.
API calls
The server is set up in a fashion that connects the front end to the server with the use of web sockets. The front end sends image data (timely snapshots) to the server, which then runs them through the model and returns an annotated image with the face detection and emotion prediction.
Algorithm working
The deep learning model accomplishes a couple of tasks:
- Face detection
Face detection refers to identifying human faces in images. Using a range of facial features such as eyes, nose, cheekbone, ears, lips, and the orientation, the algorithm is able to detect human faces from images.
2. Emotion classification
Once the face is detected from the image, the algorithm now has to classify the type of emotion depicted by the individual. Humans have 43 facial muscles controlling our reactions such as frowning, smiling and laughing. The deep learning model is trained on a massive data set of facial reactions, and it provides a probabilistic estimate of the emotion depicted by the individual.
Algorithm breakdown
The two parts of the deep learning model are combined in the function displayed below.
First, data is extracted from the image (input), this includes converting the image to Grayscale and RGB. Then the model detects all the possible faces in the frame and lists out the coordinates of the bounding boxes for further processing.
Each detected face is processed individually and run through the emotion detection model. The face is scaled up to optimize the performance.
The emotion prediction model returns the classification probability of each emotion and the class with the highest probability is added as an annotation to the face along with a bounding box.
Once all the faces are processed individually, the final image, containing bounding boxes and annotations, is constructed and returned to the server.