Talk to Me Tutorial Part 2: Speech Synthesis with the Web Speech API

Published in

Voice Tech Podcast

5 min readOct 17, 2020

--

Hi, everyone! My name is Lindsay. I’m a learner looking to share my programming journey with other learners, and even with folks who may have more experience.

Talk to Me, an application that demonstrates the web speech API.

A quick introduction

Here I’ll be discussing speech synthesis as a follow up to my tutorial on speech recognition. I explored these APIs by creating a simple website called Talk to Me, which recommends web resources to the user based on how they say they are feeling.

Because this website is intended as somewhat of a therapist, there to not only listen to how the user is feeling, but also to suggest a next step, I needed my program to speak. While the web speech API is not perhaps the most comforting or human-sounding voice, it still helps create the illusion that someone is actually listening and responding.

I used the MDN tutorial on speech synthesis to become familiar with the APIs. This tutorial is a great way to learn the basic structure of how to declare necessary variables and set up functions to adapt the web speech APIs to infinitely many purposes.

Let’s get started. There are just three steps to this tutorial but they all play an important role in making the speech synthesis come together. If you would like to follow along with the repo containing the code and the website itself, they are available as supplements for the tutorial.

The tutorial

Declare synth variable

This step is a basic one, but good for readability. It is just a quick one-line declaration for the speech synthesis object. You can call it whatever you want, but I liked synth because it creates a clear mental link to the object’s purpose.

Declaring the “synth” variable for code readability.

Build the speak function

The speak function will take the utterance (what you intend for it to say) as its argument. Since I wanted to change what was spoken based on the result received from speech recognition, I created a placeholder text in the HTML file. The placeholder gets adjusted slightly depending on what feeling is received, and the utterance is set to the placeholder.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

To personalize your application even more, you can set pitch and rate of the voice to your liking. The MDN tutorial of speech synthesis is a great way to play around with this, as you can easily experiment with different combinations of the two. I ended up keeping the pitch normal, but slightly speeding up the rate.

Setting up the speak function with an utterance and adjusting the pitch and rate.

Additionally, you can change the actual voice of the API. This turned out to be trickier than changing the pitch or rate, but with the help of some folks on Stack Overflow, I was able to get this to work. Basically, you need to write a function within your speak function to populate the list of possible voices, and then you need to run through that list to find and set the specific one you’re looking for.

Populating the voice list and selecting a specific voice from it to change the voice of the API.

Finally, once you’ve settled on what the voice should sound like, you can call the built-in speak method on your chosen utterance.

Calling the built-in method on the utterance.

Call speak function

For this application, I wanted to have the utterance spoken once the result was received by speech recognition. To accomplish this, I called speak from the built-in “onResult” method of the API. I then changed the placeholder text (which is what the API will read aloud) to indicate that a result was received and what it was, as well as encourage the user to view the resources that were recommended to them. Once the placeholder is updated, it is ready to be spoken, so this is where the function call occurs.

An example of calling the speak function when a specific result is received from speech recognition.

If speech recognition does not find a match, then the placeholder text changes to let the user know that they should try again, and speak is called.

Calling the speak function to let the user know they should try again.

In closing

Speech synthesis is a simple but powerful tool. Coupled with speech recognition it adds a human touch to any application, which can be an advantage in conversational design when used wisely.

One thing that I would like to point out is that this code was mostly written for the purpose of showcasing the web speech API rather than for practical usage. At some point, I would be interested to go back and craft a more sympathetic, personalized experience that would make a long-term impact for the user. I’d also be really curious to see what anyone else interested in trying this out could come up with.

I hope this tutorial has been useful and that you’ve learned something new, whether you are a beginner programmer or have been doing it for years. If you liked this tutorial, I welcome you to check out my other posts or come back soon for more content!