Speech Projects — Acoustical Work

Prim Wong
Super AI Engineer
Published in
4 min readApr 28, 2021
Wavenet

Communication is extremely important! The easiest and quickest ways to communicate and understand each other is the “speech”. :)

1. Generate

2. Recognize

3. Analysis

1. How to apply machine learning and deep learning methods to audio analysis

Audio Analysis — >

Machine Learning for Audio: Digital Signal Processing, Filter Banks, Mel-Frequency Cepstral Coefficients

Example waveform of an audio dataset sample from UrbanSound8k

DCT for Speech Signal Compression

https://www.researchgate.net/publication/301552643_Audio_and_Speech_Compression_Using_DCT_and_DWT_Techniques

Mel- frequency Cepstrum MFCC

Mel Frequency Cepstral Coefficients (MFCCs)

MFCC is used for the process of feature extraction where a more compact and less redundant of the representative voice can be obtained from the input voice

Filter bank — Compressed Spectrogram manipulate our ear

MFCC

Speech recognition is still a growing field. … Fast Fourier Transform (FFT) is the traditional technique to analyze frequency spectrum of the signal in speech recognition.

Wavenet

Conditional WaveGAN Explained

NGC

Real Time Cloning

Dog voice Identification

https://www.researchgate.net/publication/261394450_Dog_voice_identification_ID_for_detection_system/link/548e989a0cf214269f244515/download

Automatic Cry Recognition

Baby voice Detection

Voice Synthesis

Mean Opinion Score (MOS) for each voice. Test subjects ranked each voice on a scale of 1–5 according to how much it sounded like natural speech.

Conditional Voice Synthesis

Pixel Recurrent Neural Networks

Keywords from the Meeting

Low pass feature

Fourier Transform and then transform back

THAI SER

IEMOCAP

Speech Emotion Recognition IEMOCAP

— -

CSTR voice cloning toolkit (VCTK)

44 hours from 109 speakers

https://www.researchgate.net/publication/346248936_Non-parallel_Voice_Conversion_based_on_Hierarchical_Latent_Embedding_Vector_Quantized_Variational_Autoencoder

TenserFlow TTS(Text to Speech)

--

--