Speech Projects — Acoustical Work

Published in

Super AI Engineer

4 min readApr 28, 2021

Communication is extremely important! The easiest and quickest ways to communicate and understand each other is the “speech”. :)

1. Generate
2. Recognize
3. Analysis

1. How to apply machine learning and deep learning methods to audio analysis

Audio Analysis — >

Machine Learning for Audio: Digital Signal Processing, Filter Banks, Mel-Frequency Cepstral Coefficients

Example waveform of an audio dataset sample from UrbanSound8k

DCT for Speech Signal Compression

This example shows how to compress a speech signal using the discrete cosine transform (DCT). Load a file containing…

www.mathworks.com

https://www.researchgate.net/publication/301552643_Audio_and_Speech_Compression_Using_DCT_and_DWT_Techniques

Mel- frequency Cepstrum MFCC

Mel Frequency Cepstral Coefficients (MFCCs)

MFCC is used for the process of feature extraction where a more compact and less redundant of the representative voice can be obtained from the input voice

Filter bank — Compressed Spectrogram manipulate our ear

Speech recognition is still a growing field. … Fast Fourier Transform (FFT) is the traditional technique to analyze frequency spectrum of the signal in speech recognition.

Wavenet

Conditional WaveGAN Explained

A lot of things happened after my participation in Deep Learning Camp Jeju last summer. First and foremost, I graduated…

medium.com

chaeyoung-lee/cwavegan

Official implementation of CWaveGAN | paper | slides Chae Young Lee, Anoop Toffy. In this paper, we developed…

github.com

NGC

NVIDIA NGC

Edit description

ngc.nvidia.com

Real Time Cloning

CorentinJ/Real-Time-Voice-Cloning

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech…

github.com

Dog voice Identification

Dog voice identification (ID) for detection system

Voice recognition systems have become the important applications for speech recognition technology. In this paper, an…

ieeexplore.ieee.org

https://www.researchgate.net/publication/261394450_Dog_voice_identification_ID_for_detection_system/link/548e989a0cf214269f244515/download

Automatic Cry Recognition

Baby voice Detection

Voice Synthesis

Mean Opinion Score (MOS) for each voice. Test subjects ranked each voice on a scale of 1–5 according to how much it sounded like natural speech.

Conditional Voice Synthesis

Pixel Recurrent Neural Networks

Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image…

arxiv.org

Keywords from the Meeting

Low pass feature

Fourier Transform and then transform back

THAI SER

IEMOCAP

Speech Emotion Recognition IEMOCAP

— -

CSTR voice cloning toolkit (VCTK)

44 hours from 109 speakers

https://www.researchgate.net/publication/346248936_Non-parallel_Voice_Conversion_based_on_Hierarchical_Latent_Embedding_Vector_Quantized_Variational_Autoencoder

Unsupervised speech representation learning using WaveNet autoencoders

We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding…

arxiv.org

Generating Diverse High-Fidelity Images with VQ-VAE-2

We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) models for large scale image generation. To…

arxiv.org

Uncovering Latent Style Factors for Expressive Speech Synthesis

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual…

arxiv.org

Audio samples from "Uncovering Latent Style Factors for Expressive Speech Synthesis"

Authors: Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous…

google.github.io

WaveNet: A Generative Model for Raw Audio

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully…

arxiv.org

TenserFlow TTS(Text to Speech)

chaeyoung-lee/cwavegan

Conditional WaveGAN: Generating audio samples conditioned on class labels - chaeyoung-lee/cwavegan

github.com

ljspeech | TensorFlow Datasets

TensorFlow Lite for mobile and embedded devices

www.tensorflow.org

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic…

arxiv.org

TensorSpeech/TensorFlowTTS

Based on the script train_multiband_melgan.py . This example code show you how to train MelGAN from scratch with…

github.com

Google Colaboratory

Edit description

colab.research.google.com

Speech Projects — Acoustical Work

1. How to apply machine learning and deep learning methods to audio analysis

DCT for Speech Signal Compression

DCT for Speech Signal Compression

This example shows how to compress a speech signal using the discrete cosine transform (DCT). Load a file containing…

Mel- frequency Cepstrum MFCC

Wavenet

Conditional WaveGAN Explained

Conditional WaveGAN Explained

A lot of things happened after my participation in Deep Learning Camp Jeju last summer. First and foremost, I graduated…

chaeyoung-lee/cwavegan

Official implementation of CWaveGAN | paper | slides Chae Young Lee, Anoop Toffy. In this paper, we developed…

NGC

NVIDIA NGC

Edit description

Real Time Cloning

CorentinJ/Real-Time-Voice-Cloning

This repository is an implementation of Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech…

Dog voice Identification

Dog voice identification (ID) for detection system

Voice recognition systems have become the important applications for speech recognition technology. In this paper, an…

Automatic Cry Recognition

Voice Synthesis

Pixel Recurrent Neural Networks

Pixel Recurrent Neural Networks

Modeling the distribution of natural images is a landmark problem in unsupervised learning. This task requires an image…

Keywords from the Meeting

Unsupervised speech representation learning using WaveNet autoencoders

We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding…

Generating Diverse High-Fidelity Images with VQ-VAE-2

We explore the use of Vector Quantized Variational AutoEncoder (VQ-VAE) models for large scale image generation. To…

Uncovering Latent Style Factors for Expressive Speech Synthesis

Prosodic modeling is a core problem in speech synthesis. The key challenge is producing desirable prosody from textual…

Audio samples from "Uncovering Latent Style Factors for Expressive Speech Synthesis"

Authors: Yuxuan Wang, RJ Skerry-Ryan, Ying Xiao, Daisy Stanton, Joel Shor, Eric Battenberg, Rob Clark, Rif A. Saurous…

WaveNet: A Generative Model for Raw Audio

This paper introduces WaveNet, a deep neural network for generating raw audio waveforms. The model is fully…

TenserFlow TTS(Text to Speech)

chaeyoung-lee/cwavegan

Conditional WaveGAN: Generating audio samples conditioned on class labels - chaeyoung-lee/cwavegan

ljspeech | TensorFlow Datasets

TensorFlow Lite for mobile and embedded devices

Tacotron: Towards End-to-End Speech Synthesis

A text-to-speech synthesis system typically consists of multiple stages, such as a text analysis frontend, an acoustic…

TensorSpeech/TensorFlowTTS

Based on the script train_multiband_melgan.py . This example code show you how to train MelGAN from scratch with…

Google Colaboratory

Edit description

Written by Prim Wong