Making AI Sing is Hard

Richard Nuckolls
3 min readOct 30, 2023

I made a bad karaoke singer with Azure Speech.

Can you sing?

Azure Speech voices are trained for speaking, not singing. It’s pretty hard to make the voices sound like they’re singing. One way we can approach this is by using Speech Synthesis Markup Language (SSML). I tried applying pitch and rate, word by word in this string of text. I got a result but it’s like bad karaoke.

What is SSML?

SSML is a markup language, a subset of Extensible Markup Language (XML). SSML controls how a Azure Speech Service voice delivers it’s speech. Much like XML, it starts with a root node and namespace.

<speak 
version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xml:lang="en-US">
</speak>

Speak is the root node for SSML. This node declares the spoken language to use for the document. Child nodes contain the text to speak, along with modifiers for voice, pitch, speed and volume.

Next comes the Voice node. This defines the Voice that speaks the enclosed text. You can switch between voices as often as needed. The list of all Voices is available at Microsoft’s site.

<voice name="en-US-GuyNeural">
The quick brown…

--

--

Richard Nuckolls

Richard Nuckolls loves designing software, making things, and seeing new places. He wrote Azure Storage, Streaming, and Batch Analytics.