My Work on Speech Processing
It started in 2018 when I joined my current company. We were discussing speech processing in application, where this technology could improve people’s life. But, the problem was I hadn’t studied it before, I have zero knowledge in speech processing and only some useful understanding of deep learning. There were also a couple of guys that had started. But, it also allowed me to explore new things. I often played too much on data tabular such as customer profile, data transaction, web event log. But, I hadn’t seen how’s speech data looks like. Months after months I entered this field, I learned a lot from free courses, video lectures, tutorials, and blogs. There was a point where I thought I know how to play with these kinds of data. I will not talk about the implementation of the technology here, because it is a f boring stuff.
Voice Activity Detection
let’s start with this, Voice Activity Detection plays an important role in the automatic speech recognition system. It detects human voice presence. If there is a human voice, then the system started to record and send the data to the system. It is useful because we should not send all the data to the system.
I used this software from wiseman at github.com here to get started.
Speech Enhancement
This task was assigned to me a year before I did, I don’t know why, because I haven’t got an approach to bring it to deep learning. I started to do the research when it is getting crucial for an acoustic model. The environment we’d like to implement is not clean, the noise is everywhere and dynamic. the noise affects acoustic model performance. Therefore, the task should improve acoustic model accuracy to predict from speech to text.
At first, I tried to use Generative Adversarial Network which is mainly for image processing. It utilized two deep learning models, one acts as a generator (generate images), and the other act as a discriminator. I studied GAN in a hard way (jump to the code), you can download here. I realized this was not something I wanted. But, before I closed my eyes for GAN, I found an extended concept of GAN, It’s Conditional Generative Adversarial Network (C-GAN) which you have an instance before, and you would like to control (depends on your objective) of it. Then, I applied cGAN to Speech Enhancement, and…. it was so so acceptable lah for human ears.
I did not stop there, because it’s hard to control two architecture. and I also thought for objective function has something bother to me, I think it was better to have loss function for each architecture, but I didn’t deep dive too much for it. I found another approach to solve the task. It’s an encode-decoder network. I read a lot of papers from Interspeech conference which specializes in speech processing. I read almost a hundred papers about that. and there is one paper that I thought applicable. It’s “A Convolutional Recurrent Neural Network for Real-Time Speech Enhancement” by Ke Tan and DeLiang Wang from Ohio State University. their work relied on a convolutional encoder-decoder network which I think is not an encoder-decoder (after I finish speech recognition project which also utilizes encoder-decoder) Basically, the signal is transformed to STFT with 161 dimensional and feed it into convolution and LSTM in the middle. so the size of the input is bigger at first and it is getting smaller in the middle, and fed it into 2 layer LSTM and back to de-convolution until the size has the same with the input. If you are asking if it worked or not. I could say, it was better compared to GAN.
At the time was given, I had room for improvement (a.k.a tried another method). Some guy tried with a full convolution layer namely “A Fully Convolutional Neural Network for Speech Enhancement” by Se Rim Park and Jin Won Lee. It utilized a full convolutional layer and using STFT as a feature. But it has a different approach to how a feature formed. They use an input frame that differs from an output frame. You better check out the paper.
I also did an “unusual experiment”, I stacked 6 layers of LSTM with 256 hidden sizes. I fed the data into the networks, and it didn’t work. just don’t try as I did before guis! just don’t lah.
Speech Recognition
The last thing I did in speech processing is speech recognition. I have tried using CMUSphinx at first, and it didn’t work well. Finally, I used an encoder-decoder network for speech to text. In this task, I also learn about attention and sequence-to-sequence from here. At that time, I was using PyTorch moving from Keras and its buddy Tensorflow. The challenge was, it required a huge amount of annotated data. I was using speech data from BPPT and we also developed our dataset in-house! Most acoustic models require 80–300 hours and approximately 100 speakers. The result is pretty stunning, but it had some drawbacks that I couldn’t tell here.
Closing
If you have the same interest as what I wrote above, you can contact me through what you have. I am also glad if you want to have a small discussion for further projects. I also started to build a piece of Indonesian song to model for Jakarta AI Research Project.