Silence to Sound: Video Sound Generator

Yuqin (Bailey) Bai
8 min readDec 13, 2023

--

This article was produced as part of the final project for Harvard’s AC215 Fall 2023 course.

Team Members: Yuqin (Bailey) Bai, Danning (Danni) Lai, Tiantong Li, Yujan Ting, Yong Zhang, Hanlin Zhu
GitHub Repo: AC215_S2S

Table of Content

  1. Introduction
  2. Integrated Data Pipeline
  3. Model Development and Deployment
  4. DevOps Pipeline
  5. Demo
  6. Reference

Introduction

Have you ever seen a video that has no sound? Its soundtrack might be lost due to storage issues, or it might be silent by nature. The idea of converting silent video into audio is a captivating innovation that leverages our ability to associate sensory experiences across different domains. It can revitalize silent films by adding matching soundtracks and enhance personal vlogs with synchronized audio, creating immersive experiences, etc. Moreover, the capability of generating sound for videos and images provides more accessibility by replacing alternative text with immersive audio, while alt text is often not applicable or missing for those who depend on screen readers.

  • Problem Statement: We aim to develop an application that generates sounds from images or silent videos leveraging computer vision and multimodal models. Our goal is to enrich the general user experience by creating a harmonized visual-audio ecosystem, and facilitate immersive multimedia interactions for individuals with visual impairments.
  • Solution Architecture: We will delve into each element of the envisioned Silence to Sound (S2S) App solution architecture (refer to Figure 1) and elaborate on our strategies for addressing various scenarios that may arise during user interactions.
Figure 1. Solution architecture proposed for Silence to Sound (S2S)

Integrated Data Pipeline

We use a public dataset from the Visual Geometry Group at the University of Oxford: VGG-Sound[1] is a large-scale dataset containing 200k+ audio-visual short clips for 300+ types of sound. With each clip approximately 10s long, this is equivalent to 550+ hours of videos. The combination of audio and visual data is highly relevant for multimodal research.

Figure 2. Example video frame and audio pairs of VGGSound classes [1]
  • Data Collection: To retrieve videos from YouTube utilizing their IDs, our data collection container accepts source and destination Google Cloud Storage (GCS) locations. Subsequently, it employs the pytube library to download the videos specified in the dataset, which are then uploaded to the designated bucket. We employ multi-threading to speed up the collection process.
  • Data Preprocessing: This container enables multi-processing and includes video clipping, audio extraction, and reducing video frame per second (fps).
  • Feature Extraction: To extract meaningful features for modeling training, this container processes video frames for both optical flow (motion) and RGB (color) features. For the associated audio data, the Librosa library is employed to compute mel spectrograms. Extracted features are efficiently saved to a designated output bucket, with the script optimized for parallel execution of multiple files.
  • Data Versioning: We also set up a data registry to keep track of and manage various versions of the dataset. This ensures traceability, facilitates collaboration, and enhances the reproducibility of data-related processes throughout the development lifecycle.

Model Development and Deployment

Figure 3. Our training & inference pipeline

In this section, we outline the architecture and processes for both training and inference of our model, which is based on RegNet[2], as illustrated in Figure 3. During training, the RegNet model takes video and audio features as input and learns to generate visually-aligned sounds, represented in the form of a spectrogram, while simultaneously filtering out sound elements that are not visually relevant. When it comes to inference, RegNet is capable of generating sound that is in sync with the visual elements, again in the form of a spectrogram, using only visual inputs. The generated spectrogram is then input into a WaveNet[3], which transforms it into audible sound for human listeners.

Model Architecture

  • RegNet, as shown in Figure 4, is composed of three main components: a visual encoder, an audio forwarding regularizer, and a generator. The visual encoder converts the extracted RGB and flow frame features to visual features. Meanwhile, the audio forwarding regularizer takes the actual sound, a combination of both visually relevant and irrelevant sounds, as input and outputs irrelevant sound features. Lastly, the generator takes into account both the visual features and the identified irrelevant sound features to reconstruct visually-aligned sounds, in the form of a spectrogram.
Figure 4. Schematic of RegNet training [2]
  • WaveNet: Once the RegNet produces a visually-aligned spectrogram, we employ WaveNet[3], a state-of-the-art deep neural network designed for generating raw audio waveforms, to transform the spectrogram into audio waveforms.

Serverless Training

Our training regimen focuses on the ‘Playing bongo’ sound type. We start training the RegNet model from scratch, utilizing a dataset of 302 bongo videos. To convert the spectrogram to waveforms, we employ a WaveNet model that has been previously trained on the ‘Drum’ sound class as referenced in [2]. To streamline our training process, we package and upload our training code to the GCS bucket. Then, we leverage Vertex AI for serverless training. For experiment tracking and to visualize and analyze the training status, we incorporate Weights & Bias (WandB), a powerful tool for monitoring model performance and logging experiments. Figure 5 illustrates one of our training examples. Furthermore, we incorporate Quantization-Aware Training (QAT) into our training pipeline, which is crucial for optimizing model performance in environments with limited computational resources.

Figure 5. Experiments tracking using WandB

Model Deployment

In our initial deployment, we utilized a customized container image with TorchServe to deploy our PyTorch model to the Vertex AI Model Registry. While successful in deploying the serving container to the Vertex Endpoint, we encountered timeout issues due to the relatively lengthy inference time of our model. To address this, we opted to deploy our model via TorchServe on a Google VM instance with GPU.
The deployment process across VM and Vertex AI follows a similar pattern. We first build the image with the required dependencies for the model deployment. We archive the model artifacts into a MAR file, serving the model by exposing a port and IP for public access.

ML Pipeline in Kubeflow

Figure 6. Machine learning mega pipeline of S2S

Bringing together the containers discussed earlier, our holistic ML pipeline seamlessly operates through Kubeflow. The embedded automation guarantees that newly collected data for model training effortlessly triggers our Continuous Integration/Continuous Deployment (CI/CD) workflow using GitHub Actions, orchestrating a unified and efficient ML workflow, as visualized in Figure 6.

DevOps Pipeline

Front-End

The front-end of “Silence to Sound” is where the magic becomes visible and interactive for our users. It’s the part of our app that you see and interact with directly on your devices, be it a smartphone, tablet, or computer. This is where our app’s journey from concept to reality becomes tangible.

Figure 7. S2S front-end interface
  • User Interaction: Upon successfully setting up our Kubernetes cluster and obtaining an external IP address, users access the application through a user-friendly interface. Navigating to the provided IP address reveals an upload box, enabling users to seamlessly upload their desired videos.
  • Core Functionality: After video upload, users initiate the core functionality by clicking the “Generate Sound” button. This action triggers communication with the backend, launching the intricate process of making predictions and transforming silence into sound.
  • Download Capability: Upon completion of backend processing, a “Download Button” materializes, indicating that the processed video is ready for download. Users can effortlessly retrieve and save the transformed video.

API Service

The API service is the backbone of “Silence to Sound,” responsible for handling requests, making predictions, and orchestrating the transformation of silence into sound.

  • Communication with Front-End: The front-end triggers the API service by initiating a request upon the user clicking the “Generate Sound” button. This sets in motion the sophisticated process of making predictions and handling the audio transformation.
  • Downloadable Results: Upon completion, the API service notifies the front-end, prompting the appearance of the “Download Button.” This indicates that the processed video is ready, and users can download the results for local viewing, experiencing the harmonious integration of front-end and backend systems.

Deployment Using Ansible Playbooks

This section outlines using Ansible playbooks for efficient automation of API service and Frontend deployment. The process involves the following steps:

  1. Set up GCP service accounts for deployment and write the credentials into an inventory YAML file.
  2. Push API and Frontend images to GCP’s container registry (GCR).
  3. Create and provision a VM instance with necessary settings (e.g., pip, curl, Docker).
  4. Pull images from GCR to the instance.
  5. Start the containers and NGINX web server. The above steps are all executed via YAML files.

This approach allows for quick and efficient deployment with all necessary requirements, highlighting the effectiveness and speed of Ansible Playbooks.

Scaling Using Kubernetes

We also use Kubernetes to automate the deployment and scaling process. With Kubernetes, we can manage the necessary containers into clusters. By scaling the number of containers and adopting the self-healing capabilities, we could ensure that when there is a container failure, we can replace or restart the failed containers.

Figure 8. S2S App Kubernetes cluster

CI/CD with GitHub Actions

To integrate our code changes (e.g. model architecture optimization or bug fixes) into the current deployment, we enable GitHub Actions to automatically trigger the deployment or the whole machine learning workflow pipeline if necessary.
To enable continuous integration and continuous deployment with GitHub Actions, we

  1. Set the credentials using deployment.json in GitHub settings.
  2. Add ./github/workflows/ci-cd.yml, which is the file to indicate what kind of actions we need to perform upon receiving the commit message.

Below is a successful CI/CD action that we have acted after a commit was pushed to GitHub:

Demo

Below, we present a sample video with bongo sounds generated by our “Silence to Sound” app:

Next Steps

Beyond our initial focus on ‘Playing bongo’ sound class, we plan to expand our training to include a variety of sound classes. This will involve curating datasets for each new sound category, such as raindrops, traffic noise, animal sounds, and human voices. These datasets will be used to train specialized models or a comprehensive model capable of generating a wide range of sounds. Moreover, we will continue refining our models to improve the quality of the generated sounds. This involves experimenting with different hyperparameters and introducing more advanced training techniques.

Photo by Janay Peters on Unsplash

Reference

[1] Chen, H., Xie, W., Vedaldi, A., & Zisserman, A. (2020). Vggsound: A large-scale audio-visual dataset. ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). https://doi.org/10.1109/icassp40776.2020.9053174
[2] Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (2020). Generating Visually Aligned Sound from Videos. IEEE Transactions on Image Processing (TIP). https://arxiv.org/abs/2008.00820
[3] van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. W., & Kavukcuoglu, K. (2016). WaveNet: A Generative Model for Raw Audio. CoRR, https://arxiv.org/abs/1609.03499

--

--