“Talk to Me” Beginner’s Tutorial: Using the web speech API

Published in

Voice Tech Podcast

6 min readSep 22, 2020

--

Hi all! Welcome to my blog, which I’ve created in the hopes of helping out anyone like me who might be new to this whole programming thing and need a hand, or even for the more experienced folks who need a refresher or are looking to try something new.

Talk to Me, a web application made using the MDN speech recognition tutorial.

A brief overview

While I am certainly no expert, I am someone who enjoys learning and is passionate about sharing that with others.

Today, I want to walk through a simple but illustrative application I made based off of the MDN tutorial by Florian Schulz and Chris Mills. I made this tutorial as an exercise for my internship in computational linguistics at Fizz Studio. It incorporates both speech synthesis and speech recognition using Google’s web speech API, but today I want to specifically address the speech recognition feature. You can see my speech synthesis tutorial here.

I created an application called Talk to Me which derived some aspects of MDN’s tutorial and also added some new functionality. The code can be found here, and you might find it useful to follow along as you read. The application was designed to recommend online resources to the user based on how they say they are feeling. I had a lot of fun designing the product and coming up with potential new functionalities to add.

So, without further ado, here are the steps I took (and you can too!) to create the speech recognition functionality of a new application.

The tutorial

HTML variables

I began by considering what parts of the HTML file I would need to access in my JS code in order to be able to interact with them. For this product, the results were sorted into three output categories, and in order for each of those to show up at a specific time, they each needed a variable. Additionally, I had two paragraph elements acting as bookmarks for where I needed certain text to be placed, so I created variables for those as well.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

One thing that stood out to me about the tutorial, and which I had to ask for clarification on, was the use of the class and ID tags in the HTML. Many instances of the class tag occur only once, and thus might be more appropriately described with the ID tag. While this makes no difference in the actual functionality of the program, it did cause me to stop and wonder why that might have been done, or whether it was intentional; and for my own program, I stuck with class for grouping things together, and ID for unique elements.

Code for creating variables for HTML elements that need to be accessed in JavaScript.

Webkit variables

The next step was to declare variables that would connect the program to Webkit, which is a browser engine that renders, or portrays, HTML and CSS. This step is just routine, so I used the same declarations as the MDN tutorial.

Constrained vocabulary

Next, I went ahead and set up the constrained vocabulary for this program. I designed this slightly differently from the MDN tutorial, based on the application’s intended functionality. Rather than just having one long list of words to choose from, I wanted the user to have the ability to choose from three categories. I created a central array of the three arrays which united all the words into one place, and used that to set up the constrained grammar.

The actual grammar declaration uses JSGF notation, something I was not familiar with prior to this exercise. I used the same declaration as in the MDN tutorial here, only changing the name of the array wherever it was used.

Code for creating a constrained vocabulary using arrays and a JSGF declaration.

Connect API to grammar

The code for this section was again very simple, and I did not make any modifications from MDN’s tutorial code. The basic functionality of these lines of code is to create a new speech recognition object and grammar list, and essentially connect the two things.

Code for connecting the speech recognition API to the constrained vocabulary.

API settings

Again, there is no variation here from the MDN code. These are settings for the API that instruct it to listen for a single English word and then stop, rather than listening continuously or returning interim results before the user has finished speaking the word.

Code for the proper settings of the speech recognition API.

Function (.onclick)

The .onclick function is activated when the user clicks anywhere in the body of the document. The placeholder text appears to let the user know they can start speaking, and the speech recognition can begin once the user gives permission to the browser to access the microphone.

Code for the .onclick function, which begins the speech recognition.

Interpret results

In order to access what was recognized by the API, I wrote a function using “recognition.onresult”, a built-in feature. To retrieve the single word that should have been recognized, I used the below line of code, which comes from the Mozilla tutorials. This line retrieves the first word from the transcript, as well as the first alternative in case of low confidence. However, if you are interested in recognizing more than one word, this tutorial gives an excellent example of how to do so.

In this final step my code again diverges significantly from the MDN tutorial. Since I was unable to get .onnomatch to behave properly (and indeed found, when I went back to check the reference tutorial, that it didn’t do its job there either), I had to find a workaround.

Because the functionality of the application depends on the three different categories of feelings, and I had a separate array for each of them, I realized I could employ the .includes method on the arrays. If the word spoken was in one of the three arrays, then the results of that array could be made visible. If not, then I would change the placeholder text and ask the user to try again. While this does not utilize the built-in function of the speech API, it is an alternate way to solve the problem I had and it works well for this program.

Code for one possible outcome where the speech is recognized as part of the constrained vocabulary.

Code for the outcome in which the speech is not recognized as part of the constrained vocabulary.

In closing

I enjoyed building on the MDN tutorial to create my own product, and found it to be a great introductory resource for using both speech recognition and synthesis in your own code. While I did run into a few quirks, it was overall a very helpful tool. I welcome dialogue about those issues discussed above, and whether anyone else has encountered them or come up with a different fix than I did.

I hope my tutorial has also been of use and that by going through it step by step with me, you’ve added something to your own body of knowledge.

If you’ve enjoyed, please be sure to check back soon for new tutorials and posts.