AutomEditor: Video blooper recognition and localization for automatic monologue video editing

Multimodal video action recognition and localization methods for spatio-temporal feature fusion by using Face, Body, Audio, and Emotion features

Published in

HCI@WVU

15 min readMay 1, 2019

Abstract

Video blogs are every time more popular because of online streaming platforms. Anyone can post content no matter their video editing skills. Novice video bloggers have had to acquire these skills to publish quality content. Video editing is usually a time-consuming task that discourages users to publish periodic content. The most common format for individual video bloggers is the monologue. Monologues have fixed conditions, such as one person at a time and a fixed camera position. Monologues are a perfect setting for automatic video editing (AVE). In this article, we present AutomEditor, a system that automates monologue video editing. AutomEditor uses multimodal video action recognition techniques to detect video bloopers. AutomEditor extracts body skeleton, face, emotions, and audio features from video clips. Our model implements early feature fusion over recurrent neural networks and multi-layer perceptrons. The model was trained and evaluated by using the BlooperDB, a manually collected and annotated dataset. Our model got a 100\% accuracy in the validation set and a 90\% in the test set. We propose a blooper localization algorithm for untrimmed videos, based in the predictions frequency. We implement a web interface to visualize the blooper fragments. AutomEditor was able to successfully locate and visualize all the bloopers on test untrimmed videos. We conclude presenting implications for design.

Introduction

A blooper is a short clip from a film or video production, containing a mistake made by a person on screen. Detecting bloopers is not a trivial task even with a sophisticated tool. Existing video editing systems, such as Abode Premiere, are a great help for editing video, but the task is still a tedious and time-consuming requiring significant editing skills and an aesthetic sense. In this article, we present a system that automates monologue video editing. By automated video editing (AVE), we refer to a process which automatically selects suitable or desirable segments from an original video source to create an edited video segment. Generally watching a long unedited video requires a great deal of patience and time. An effective way to attract a viewer is to present a video that is as compact as possible, yet preserves the most critical features and has no evident errors.

The proper detection of actions in videos relies on long-term contextual information modeling and cross-modality analysis. Since gestures, emotions, and voice tone levels normally change gradually under the same context, analyzing the long-term dependency of these factors will stabilize the overall predictions. Meanwhile, humans perceive others’ mistakes by combining information across multiple modalities simultaneously. Combining different modalities will yield better recognition with more human-like computational models.

In developing our multimodal system, we have been inspired by many previous works, such as combining visual and audio features, as well as speech content. People have also combined physiological signals for emotion recognition tasks. Methods of combining cues from each modality can be categorized into early or late fusion. For early fusion, features from different modalities are projected into the same joint feature space before being fed into the classifier. For late fusion, classifications are made on each modality and their decisions or predictions are later merged together, e.g. by taking the mean or other linear combination. Some works even implemented a hybrid fusion strategy to utilize both the advantages of late fusion and early fusion.

In this article, we investigated the use of a number of feature extraction, classification and fusion methods. Our final quadmodal method aggregates face, body, audio and emotion features for a single-shot video clip level classification using early fusion. To verify the effectiveness of multimodal fusion, we compared it with ten unimodal methods. Our proposed multimodal approach outperformed the unimodal ones as well as the baseline methods, achieving validation set accuracy of 1.0, and test accuracy of 0.9.

Video localization has been broadly studied for unimodal methods, mainly from image only analysis. Video action localization in multimodal methods is challenging especially when there are temporal and non-temporal features mixed. This is why we proposed a localization method based on the analysis of prediction sequences. The method was effective to localize all the bloopers inserted from the test set in untrimmed videos.

We deployed AutomEditor as a web interface where users are able to submit their videos and visualize the fragments that contain bloopers. In this article, we describe the whole process of enabling a video action recognition and localization system, from the database creation to the deployment of a web application.

AutomEditor

AutomEditor is an end-to-end solution that automates monologue video editing. Figure 1 shows the six stages of the solution that are: 1) Blooper DB: a Bloopers dataset; 2) Feature Extractor: Extract the features from the video clips; 3) Learner: Trains and evaluates the model; 4) Predictor: Retrieves a sequence of predictions; 5) Locator: Localizes the blooper clips in untrimmed videos; 6) Server: Exposes a web service and shows the results in a web interface.

Blooper DB

The lack of datasets focused on bloopers force us to build our own database, we titled it Blooper DB. The Blooper DB is a long-term multimodal corpus for blooper recognition. It is constructed by picking out the videos that contain video bloopers from Youtube videos using keywords like `bloopers’, `green screen’, etc. The videos have multiple resolutions and multiple languages. The dataset is split into training, validation and testing sets. There are 464 videos in the training set, 66 videos in the validation set, and 66 videos in the testing set. Each video clip is annotated by categorical labels, 0 (no blooper) and 1 (blooper). Each video clip lasts between 1 to 3 seconds. The dataset is stratified and has an equal number of samples per each category. Figure 2 shows some examples of the dataset.

Different criteria were taken into account to select the videos. The videos were monologues where only one person was on screen, the camera is fixed, the shoulders are visible, and there are no face pictures in the background. We split long bloopers (more than 2 seconds) into two video clips, a non-blooper (before the mistake) and blooper (the clip that contains the mistake). For short bloopers (1 to 2 seconds), we looked for other video clips of non-bloopers from the same video of about the same length. The video clips do not start or end in truncated phrases. We tried to avoid as much as possible cases where bloopers were green-screen and non-bloopers not.

Feature extractor

The main aim of our feature extraction process is to maintain as invariant factors the person descriptors (i.e. gender, age, etc), scale, position, background, and language. Our approach considers four different types of features, that are features from face, body, audio, and emotions.

1) Face features: Visual features consist of OpenFace estimators on the whole frames, and VGG face representation on facial regions. For OpenFace features, we use OpenFace toolkit to extract the estimated 68 facial landmarks in both 2D and 3D world coordinates, eye gaze direction vector in 3D, head pose, rigid head shape, and Facial Action Units intensity indicating the facial muscle movements. The detailed feature descriptions are seen here. Those visual descriptors are regarded as strong indicators of human emotions and sentiments. For the VGG face representation, the facial region in each frame is cropped and aligned using a 3D Constrained Local Model. We zero out the background according to the face contour indicated by the facial landmarks. Then, the cropped faces are resized to 224x224x3 and fed into a VGG Face model pre-trained on a large face dataset. We take the 4096-dimensional feature vectors in the fc6 layer and concatenate them with the visual features extracted by OpenFace. The total dimension of the concatenated features is 4805. Specifically, 20 frames are uniformly sampled from each video clip and fed into the network for training and testing. In the case of a shorter length of a video clip, we duplicated the last frame to fill the gap.

2) Body features: We used from OpenPose (a pose estimator framework) the Body-25 model that extracts 25 joints of the person’s skeleton from an image. We computed the joint angles from shoulders, arms, neck, and nose, as well as a binary flag per each joint to indicate if it was present. In total, we extracted and normalized 11 handcrafted features from the OpenPose output. We only used joints from the nose, ears, eyes, neck, shoulders, and arms, that are the parts of the body that are usually visible in monologues. From the filtered joints we draw a skeleton with the neck joint fixed at the center, the dimensions normalized. and this skeleton is inserted in a frame of size 224x224 with a black background. This visual representation is passed through a VGG16 net to extract 4096 features. We take the 4096 deep features vector from the fc6 layer and concatenate the 11 features computed from the OpenPose joints. The total dimension of the concatenated features is 4107. 20 frames were uniformly sampled per clip.

3) Emotion features: We used EmoPy a machine learning toolkit for emotional expression to extract the score of each of the seven basic emotions (anger, fear, disgust, happiness, sadness, and contempt) typically used for Facial Expression Recognition (FER). The same seven features were extracted from other four emotion recognition models (priya-dwivedi, petercunha, oarriaga). We concatenated all the predictions from the models into a 35 features vector. The same 20 faces used for computing the face features were used to compute the emotion temporal features. We condensed the temporal features to a general feature vector by performing a normalized sum to each prediction segment resulting in a vector of 35 elements.

4) Audio features: Audio features are extracted using openSMILE toolkit, and we use the same feature set as suggested in the INTERSPEECH 2010 paralinguistics challenge. The set contains Mel Frequency Cepstral Coefficients (MFCCs), ∆MFCC, loudness, pitch, jitter, etc. These features describe the prosodic pattern of different speakers and are consistent signs of their states. For each video clip, we extract 1582 dimensional features from the audio signal. We processed the general features from the whole video clip audio, and the temporal features from the analysis of 20 fragments of the video clip audio.

Learner

System Architecture Figure 3 shows the architecture of our proposed model. Our deep neural network model consists of three parts: (1) the sub-networks for every single modality; (2) the early fusion layer which concatenates four unimodal representations together; and (3) the final decision layer that estimates the sentiment.

1) Sub-networks: There are 6 subnetworks, two from the face, one with handcrafted features and the other with deep features, all the features are in a sequence of 20 samples. The same case with body features. Emotions are the result of predictions of existing models and are also in a sequence of 20 samples. The audio handcrafted features were taken from one sample per video clip.

2) Early fusion layer: This part concatenates four unimodal representations together, in specific face related features and body related features are joined. The concatenated features from a single video clip are further fed into an LSTM layer with 64 hidden units followed by a dense layer with 256 hidden neurons for temporal modeling. The audio and emotion features are then fed into a fully connected layer with 256 units.

3) Fusion and Decision Layers: We combine cues from the four modalities using early fusion strategy. The aggregated feature vector is fully connected to a two-layer neural network with 1024 hidden units and a single output neuron, activated by softmax. We use MSE as the loss function for joint training.

Predictor

This module is able to get an untrimmed video and split it into fragments of the same length (2 seconds was the default). The precision of our algorithm depends on the separation of each clip. This parameter can be set into the platform and it is defined in milliseconds (500 milliseconds was the default). Once the full-length video was divided into multiple overlapped video clips, then we proceeded to extract features per each video clip and predict the blooper score. We defined the blooper score as the value retrieved by the model for the blooper category.

The predictor output is a sequence of predictions per the configured time-lapse between predictions. The scores are further analyzed in the Locator component to find the ranges where bloopers occur.

Locator

This module takes the sequence of previously computed blooper scores. Instead of using a 0 to 1 scale, It uses the 0, 1, and 2 values. 0 stands for blooper score = 0, 1 for intermediate values in a threshold, and 2 for blooper score = 1.

The main goal of the Locator module is to find the range of high valued numbers that are together (range). To facilitate this task we first preprocess the prediction sequence. First, we define a sliding window of a configurable size that adds the values of their neighbors to condense the scores and to keep some context of each point. We called this new list of added values as the condensed list. From the condensed list, it computes the three highest values and stores them in the top 3 list. Then the Locator defines another sliding window (of configurable size) and calculates the percentage of elements that are in the top 3 values (within the window). It uses a percentage threshold to add the index of the window to a range. The grouped range index positions correspond to the times of the video that contain the bloopers. The localization steps are explained in Figure 4.

Server

This component is a Python web server using Flask. The server retrieves to the browser a user interface where a video control is displayed with their controls, as well as an extra time bar and a file input control. Once a file is chosen the platform displays it locally in the web player and the user can proceed to send it. The interface shows a status bar of the current state of the file upload. Figure 5shows some examples processed by the interface.

When a file was uploaded to the server via a multipart form-data HTTP request, it is stored temporally and renamed with a unique id (UUID) as the name of the file to prevent duplicates. The server implements size and length filters to prevent to exceed the processing and storage capabilities of the server. The server then invokes the Predictor followed by the Locator that gets the predicted sequence values. The Locator computes the blooper ranges and retrieves them in a JSON (JavaScript Object Notation) format.

Once the web client gets the serialized ranges, it deserializes and processes them to display in an extra timeline that is below the player. The timeline has the capability to navigate the video when is clicked on a blooper range. The tool is able to edit the video in case the user wants it.

Experiments

We evaluated the blooper recognition and location modules of the tool.

Video blooper recognition

We trained and evaluated the multimodal network on Blooper DB. The model was trained for at most 300 epochs. To prevent overfitting, we applied an early-stopping policy with 20 epochs patience, which means to stop training after the validation loss does not drop for 20 epochs, and we deployed dropout strategy with ratio 0.5 for each fully connected layer. The learning rate was 1e-3.

A. Unimodal Approach

We first evaluated the performance of a model trained with a single modality. For each unimodal model, the same decision layer was deployed. For visual unimodal model, we investigated the effectiveness of VGG-face and OpenFace features separately in an ablation test. The comparison results are shown in Table I. Our results demonstrated that VGG-face features outperformed OpenFace features under the same model architecture. The same behavior can be detected in the body features when comparing handcrafted and deep features.

For the audio network, we focused on studying the importance of temporal modeling in video clips. We implemented another LSTM-based network for audio modality. Specifically, we divided each audio file into audio frames equally spaced in time and extracted openSMILE features for every single frame. Those features are then fed into a 64 cells LSTM layer followed by the decision layer. We compared this LSTM-based model with our audio unimodal model described. The results show the model without LSTM performs better than the audio model with LSTM. The LSTM layer does not benefit the estimation. Table 1 shows a comparison of all the evaluated models.

B. Multimodal Approach

We titled the fusion of all the modalities as the quadmodal network. The quadmodel contains the Face Handcrafted, Face Deep. Body Handcrafted, Body Deep, Audio General, and Emotion Temporal features. We trained the quadmodal network by using the concatenated multimodal features. Figure 6 shows the training performance metrics.

Figure 6 Quadmodal training performance metrics

With respect to fusion strategies, We compared the early and late feature fusion strategies in Table 2.

The results demonstrated that learning benefits more from early fused representation. Table 1 shows the comparison of the unimodal and multimodal performances. We also compared the fusion of the top 3 accuracy unimodal models with the quadmodal, but quadmodal outperformed the model containing face and audio features only. The quadmodal model has better performance than any of the tested unimodal and multimodal models.

Then we evaluated the confusion matrices for the quadmodel to evaluate how was the performance per each category over the different sets. Figure 7 shows the confusion matrices of each set.

Video blooper localization

In order to test if the localization algorithm was retrieving the correct time ranges we created two videos of 70 seconds length and inserted 6 videos (3 bloopers and 3 non-bloopers) from the test set in random positions as it is shown in Figure 8.

Then we ran the Locator to see how many ranges were detected and to verify if the bloopers were within the ranges. Figure 9 shows that the two videos retrieved three ranges each and these contained the video bloopers.

Limitations and Future Work

Our dataset is small in comparison with multi-action datasets that are designed for robust analysis. The limited amount of subjects makes it difficult to generalize to other people. Future work can explore the augmentation of the database by collecting more videos from people with different level of arousal. Artificial augmentation by generative techniques would be an interesting area of study.

The aim of this work was to show an end-to-end approach of a solution of this type. This work does not cover in depth all the blocks of the solution. Future work can explore one of these components in detail to understand why these performed in that way and to explore improvements.

Multimodal video action localization area can be benefited from having built-in mechanisms in the models to locate actions from mixed temporal and non-temporal features. The precision of this approach (500ms) is not enough for professional standards or for fully automated mechanisms. Future work can explore high precision localization methods for video bloopers.

The Human-Computer Interaction area can be benefited from exploring the expectations of users about automatic video interfaces. How novice and expert users want to be assisted in the editing process. Visualization only versus automatically applied editions would be an interesting question to have it solved before the massive implementation of AVEs.

Conclusion

In this article, we present AutomEditor a system that is able to recognize and localize bloopers from monologue videos. To achieve that goal we created a dataset from online videos collected and annotated manually, we called it BlooperDB. We choose a multimodal approach to analyze features from face, body, audio, and emotions extracted from videos. Early feature fusion strategy is deployed for combining the different modalities. Our multimodal models outperform the unimodal methods significantly. Our results show that multimodal information will greatly benefit the estimation of bloopers on videos. We also present a localization method based on the prediction sequence over sub clips that demonstrated to localize bloopers effectively. We deployed AutomEditor as a web application to help users to utilize this tool and to developers to easily deploy automatic video editing tools.