UnetSourceSeparation: A machine learning model to remove audio noise and extract voices
This is an introduction to「UnetSourceSeparation」, a machine learning model that can be used with ailia SDK. You can easily use this model to create AI applications using ailia SDK as well as many other ready-to-use ailia MODELS.
UnetSourceSeparation is a audio separation model released in March 2019. It can cancel background noise from an input audio file and extract voices.
Phase-aware Speech Enhancement with Deep Complex U-Net
Most deep learning-based models for speech enhancement have mainly focused on estimating the magnitude of spectrogram…
The official demo of voice separation is shown below. The original voice and the processed voice are played alternately.
In speech processing, it is common to perform Short Time Fourier Transform (STFT) on the input speech and apply CNN in frequency space. the output of FT (Fourier Transform) is a Complex Value, which consists of Magnitude and Phase.
Phase estimation is difficult, and in conventional speech separation, only Magnitude is estimated. However, when the Phase of the original material is used as it is, there is no problem when the Signal-to-Noise Ratio (SNR) is high (low noise), but when the SNR is low (high noise), there is a problem that noise remains.
In UnetSourceSeparation, the new
Complex Value Convolution enables Phase prediction.
The architecture of UnetSourceSeparation is as follows. The input audio is processed using STFT to obtain the frequency components and then passed through Unet architecture using Complex Convolution. It then creates a mask and use it to remove the noise, and returns to the waveform using Inverse Short-Time-Fourier-Transform (ISTFT).
The DSD100 dataset is used for training.
DSD100 | SigSep
The dsd100 is a dataset of 100 full lengths music tracks of different styles along with their isolated drums, bass…
Given an input audio file, the output audio file is generated. The default model used it the noise reduction model for voice separation in general speech.
$ python3 unet_source_separation.py --input WAV_PATH --savepath SAVE_WAV_PATH
To perform voice extraction in a music song, add the parameter
$ python3 unet_source_separation.py --input WAV_PATH --savepath SAVE_WAV_PATH --arch large