AI Scholar: Achieving End-to-End Emotional Speech Synthesis

Published in

AI³ | Theory, Practice, Business

2 min readJul 2, 2019

This research summary is just one of many that are distributed weekly on the AI scholar newsletter. To start receiving the weekly newsletter, sign up here.

When it comes to developing robust human-machine interaction models, emotional speech is a crucial component. As a result, there have been many attempts to add emotive effects to synthesized speech in the recent past.

Studies have been and continue to be done, and several prototypes and systems based on different synthesis techniques have been developed. For instance, several deep learning approaches have been designed to help improve the naturalness of synthetic speech.

End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

Chinese researchers have proposed a semi-supervised emotional speech synthesis (ESS) training method that uses global style tokens (GSTs) aiming at the condition that only a small portion of training data has emotion labels.

The architecture of the encoder using in our baseline Tacotron model.

The proposed model is based on the GST-Tacotron framework. Style tokens are well-defined to present emotion categories, and a cross entropy loss is introduced between the token weights and the emotion labels to establish a one-to-one correspondence between tokens and emotions. Algorithm parameters are then estimated by multi-task learning through available emotion labels training samples.

Potential Uses and Effects

Improved emotional speech synthesis goes a long way to boost all kind of human-machine interactions.

On evaluation, this newly proposed model outperforms the traditional Tacotron emotional speech synthesis model when only 5% of training data has emotion labels. By using only 5% emotion labels, the proposed model demonstrated the naturalness and emotion expressiveness of the conventional when it uses all emotion labels.

Emotion recognition experiments confirm that this method can achieve a one-to-one correspondence between style tokens and emotion categories effectively.

Thanks for reading. Please comment, share and remember to subscribe to our weekly newsletter for the most recent and interesting research papers! You can also follow me on Twitter and LinkedIn. Remember to 👏 if you enjoyed this article. Cheers!

AI Scholar: Achieving End-to-End Emotional Speech Synthesis

End-to-End Emotional Speech Synthesis Using Style Tokens and Semi-Supervised Training

Potential Uses and Effects

Written by Christopher Dossman