Speech-to-Text for C3 AI Assistant

Published in

d*classified

9 min readMar 23, 2023

Speech is a natural way of interacting with an AI Assistant. Automatic speech recognition (ASR) models are used in commercial AI Assistant products to transcribe users’ speech to text, which is then interpreted as commands or queries. We investigated the adaptation of ASR models to domain-specific lingos and local accent, and use with C3 (Command, Control, and Communication) AI Assistant in an offline environment. This was a project by Pang Kai Lin for her internship at C3 Development, supervised by Meo Kok Eng and Wong En Teng.

Background

AI Assistants help operators of C3 systems on tasks such as information retrieval and content recommendation. While operators typically interact with AI Assistants via a text interface, speech is a more natural way of interaction, and allows operators to give commands to AI Assistant anytime. However, pre-trained ASR models may not recognize military domain-specific lingos and local accent. Therefore, there is need to adapt these ASR models and continuously improve them for our use.

In this project, there were three goals for the trained ASR model:

Recognise speech in Singaporean accent
Recognise common C3 vocabulary (e.g. ‘incident reporting’)
Recognise lingos not in English corpus (e.g. non-English location names ‘Yishun’, ‘Bishan’)

To achieve the three project goals, pre-trained ASR models were fine-tuned. In addition, synthetic audio data was generated using Text-to-Speech to provide training data from text conversations with the AI Assistant. The next few sections describe the experiments and results.

Model Training Workflow Overview

An on-premise AI development platform was used for the experiments. The model training was orchestrated with ClearML, which automatically built the container required using code from GitLab, Docker image from Docker Harbor Registry, and data from S3 bucket. The container was then executed on local GPU cluster over Kubernetes. ClearML tracked the experiments and versioned the models.

ASR Model

The baseline ASR model chosen was the NVIDIA Quartznet 15x5 Neural Module. This model was chosen as it is trained on 2000 hours of Singapore’s National Speech Corpus (NSC), so the pre-trained model would already be relatively more accustomed to the Singaporean accent.

Transfer learning was conducted on the pre-trained model, with both the encoder and decoder frozen and the batch normalization layers unfrozen. The freezing of layers helps prevent the model from overfitting since the size of the training dataset used for fine-tuning is incomparable to 7000 hours of speech used to train the pre-trained model.

TTS and Vocoder Models

The baseline models chosen were the NVIDIA FastPitch Transformer TTS model, which converts text to spectrograms and the NVIDIA HiFiGAN vocoder which converts spectrograms into audio files.

The pre-trained FastPitch TTS model was trained on LJSpeech, 24 hours of American Speech, and fine-tuned with the NSC Part 1 and Part 2. Fine-tuning ensures that the synthetic voices generated would have the Singaporean tone and accent, to achieve the first goal listed in Section 2. The Part 1 corpus consists of standard English, without any non-English vocabulary, while Part 2 consists of Singlish, including Singaporean names, locations, food etc, which are non-English terms. Each speaker has approximately 40–60 minutes of speech data recorded in the NSC. Separate speech models were trained for each individual speaker and all the models will be used in speech generation to increase the size of the dataset and speech diversity.

Synthetic Speech Experiment

Performance of Synthetic Speech

The TTS model was fine-tuned with speech data from 10 speakers respectively in each part of the NSC, generating a total of 10 separate speaker models for each part. A script consisting of only standard English was used to generate synthetic speech as test data to be transcribed by the ASR model. The metric used for comparison is the word error rate (WER).

The WER of transcription by the ASR model when tested with the synthetic speech generated by speakers in NSC Part 1 was 23.254% and 28.862% for speakers in NSC Part 2. The quality of speech is better for Part 1 as expected because the corpus contained only standard English phrases. In comparison to the LibriSpeech dev-other dataset, a natural American speech dataset that had a WER of 11.3% when transcribed, the synthetic speech generated did not perform as well.

Optimizing Quality of Synthetic Speech

In order to improve the quality of synthetic speech, a few techniques were explored as listed below. Data from speaker 91 in NSC Part 1 was used in fine-tuning of the TTS model. The WER using the original synthetic speech data generated is 22.133%.

Fine-tuning vocoder with spectrograms generated from natural speech data. The industry approach to generating personalized synthetic data is usually by finetuning the TTS model, but fine-tuning the vocoder may improve the results. WER increased to 32.173%. Since there is currently a lack of literature for fine-tuning of vocoders, my approach to fine-tuning may have been inaccurate and resulted in overfitting.
Reduce noise in data before using it to finetune TTS model. WER increased to 23.821%.
Upsampling data to 22.05 kHz before using it to fine-tune TTS model. The NVIDIA FastPitch model was trained on speech sampled at 22.05kHz. WER decreased to 21.646%.
Removing silence at the start and end of audio files before using it to fine-tune TTS model. Past literature has suggested the effectiveness of this method [9]. WER decreased to 19.610%.
Reduction of pitch peaks in audio files used using the TD-PSOLA algorithm, before using it to fine-tune TTS model. Past literature has suggested the effectiveness of this method. WER increased to 25.074%.
Reduce noise in synthetic speech generated, before feeding it into the ASR model for transcription. The signal-to-noise ratio of synthetic speech averaged at 54.542dB while that of natural speech averaged at 81.800dB. A possible hypothesis to the comparatively higher WER in synthetic speech could be the excessive noise. WER increased to 23.160%.
Downsampling synthetic speech generated to 16kHz, before feeding it into the ASR model for transcription. The NVIDIA Quartznet model was trained on speech sampled at 16kHz. WER decreased to 21.211%.
Using combined data from multiple speakers to fine-tune the TTS model, generating a single synthetic voice from multiple speakers. The increased dataset size could possibly improve the quality of the synthetic voice generated but could also worsen the quality if the tone, pitch and intonation of the speakers vary excessively. The first experiment involved combining data from speakers of similar pitch and tone. Using separate individual speech models, the WER of speakers 91, 124, 135 were at 22.133%, 24.639% and 16.269% respectively. Indeed, when the speech data was combined to train a single model, the WER decreased to 15.452%, lower than that of all individual speakers. The second experiment involved combining data from speakers of diverse tone and pitch. Using separate individual speech models, the WER of speaker 135, 208, 245 were at 16.269%, 18.392% and 20.498% respectively. Indeed, when the speech data was combined to train a single model, the WER increased to 21.472%, higher than that of all individual speakers.

Speech-to-Text Experiments

Recognise Speech in Singaporean Accent

To test the above, I tried fine-tuning the ASR model with synthetic speech generated by an individual speaker model, to check if the trained model could recognise natural speech spoken by that same speaker better. Speaker 2002 from the NSC Part 2 was chosen to fine-tune the TTS model, generating 4.43 hours of synthetic speech to train the ASR model. For testing, I used the natural speech from speaker 2002 in the NSC Part 2, observing the change in WER after training the ASR model. Although NSC Part 1 could generate better quality synthetic speech, shown in Section 5.1, I used the Part 2 corpus because the pre-trained model has already been trained on NSC Part 1, thus natural speech in the Part 1 corpus cannot be used as test data.

The original WER was 36.290%, but increased to 38.746%, 42.492% and 77.842% after training for 1, 2 and 5 epochs respectively. The increase in WER could possibly be attributed to overfitting. The first attribute to the poorer performance after training may be the lack of diversity. In ASR model training, a group of speakers is usually used to ensure that the model can pick up the pronunciation of specific phonemes, and not be overfitted to an individual’s tone. In this case, since one speaker was used, the model may have been overfitted, thus this may not have been the correct approach for testing. Another attribute could also be the low quality of synthetic speech, especially when it is generated using NSC Part 2, which synthetic speech gave a higher WER.

Recognise Common C3 Vocabulary

To test out the above, I tried fine-tuning the ASR model with synthetic speech consisting of specific terms, to check if the trained model could recognise these specific terms better. 10 speaker models were trained using the NSC Part 1, generating 35 min of training data, with 50 random sentences used as the training script. To test the ASR model after fine-tuning, I recorded myself saying the exact same 50 sentences to check if the WER decreased after training. The original WER of the test data was at 5.41%. The WER successfully decreased to 4.19% after training for 2 epochs, which may suggest that training the ASR model with synthetic speech could possibly allow us to achieve the above goal.

Recognise Lingos not in English Corpus

To test out the above, 8 non-English terms, namely Bishan, Boon Lay, Bukit Batok, Bukit Gombak, Jurong, Sembawang, Tiong Bahru and Yishun, were chosen as the ‘new vocabulary’. The TTS model was fine-tuned with speech data from 10 speakers respectively in each part of the NSC, generating a total of 10 separate speaker models for each part and 2.4 minutes of synthetic speech for training. To test how well the trained model is at recognising these 8 terms, I self-collected 9 minutes of test data from my friends and families recording themselves saying these terms.

The initial WER was 126.93% but decreased to 106.931%, 110.89%, 95.643% after the ASR model was trained with synthetic speech generated by speaker models from the NSC Part 1 for 1, 2 and 5 epochs respectively. The WER also decreased to 91.881%, 70.297%, 56.436% after the ASR model was trained with synthetic speech generated by speaker models from the NSC Part 2 for 1, 2 and 5 epochs respectively. The decrease in WER may suggest that training the ASR model with synthetic speech could possibly allow us to achieve the above goal.

The better performance of synthetic speech generated by NSC Part 2 speaker models may be attributed to the Singlish corpus, which did not contain the exact 8 non-English terms but contained words with similar phonemes as the 8 non-English terms. For instance, the speaker was recorded saying Bah Kut Teh, which enabled her synthetic voice to pronounce ‘Bahru’ in Tiong Bahru accurately. Similarly, the speaker was recorded saying the Chinese surname Wang, thus enabling her synthetic voice to pronounce ‘Wang’ in Sembawang accurately.

Putting it Together

Speech-to-text Feature

A prototype speech-to-text feature, as Botpress custom-composer module, was implemented to demonstrate the output of ASR model. This feature has a microphone icon at the side of the composer to start and stop recording. When the microphone icon is clicked, the MediaStream Recording API starts to record the audio and encode it as Base64 string. This is then posted to another locally hosted API to convert the string back to a .wav file, and transcribe by the ASR model. The transcript is then updated on the composer.

Text-to-Speech Feature

A prototype text-to-speech feature was also implemented to generate synthetic voices from text input to the chatbot. This provided additional training data for the ASR model in recognizing C3 vocabulary. To achieve this, when the chatbot receives a text event, the text is sent to locally hosted API. All text payloads are collated in a .txt file, which would be then used as the script to generate speech that will be fed into the ASR model for training.

Conclusion

The experiments demonstrated that pre-trained ASR models could be adapted for use in C3 AI Assistant context. Performance of the TTS and ASR models could be improved in the future with techniques such as hyperparameter tuning and training with more relevant data. There could also be exploration of quantitative methods to cluster speakers of similar pitch and tone.