Making AI Sing is Hard

3 min readOct 30, 2023

I made a bad karaoke singer with Azure Speech.

Can you sing?

Azure Speech voices are trained for speaking, not singing. It’s pretty hard to make the voices sound like they’re singing. One way we can approach this is by using Speech Synthesis Markup Language (SSML). I tried applying pitch and rate, word by word in this string of text. I got a result but it’s like bad karaoke.

What is SSML?

SSML is a markup language, a subset of Extensible Markup Language (XML). SSML controls how a Azure Speech Service voice delivers it’s speech. Much like XML, it starts with a root node and namespace.

<speak 
    version="1.0" 
    xmlns="http://www.w3.org/2001/10/synthesis" 
    xml:lang="en-US">
</speak>

Speak is the root node for SSML. This node declares the spoken language to use for the document. Child nodes contain the text to speak, along with modifiers for voice, pitch, speed and volume.

Next comes the Voice node. This defines the Voice that speaks the enclosed text. You can switch between voices as often as needed. The list of all Voices is available at Microsoft’s site.

<voice name="en-US-GuyNeural">
    The quick brown…

Making AI Sing is Hard

Can you sing?

What is SSML?

Written by Richard Nuckolls