Voice Recognition — The End of the Keyboard

Published in

Voice Tech Podcast

8 min readMar 6, 2019

For the last 40,800 years the human race has predominantly used their hands to put their thoughts and ideas on a canvas for the world to see.

From the oldest cave painting known to exist in El Castillo, northern Spain, to the Last Supper painted by Leonardo da Vinci in the 15th century. Although nowadays the vast majority of us use the keyboard to quickly and conveniently take notes or create graphic illustrations. This way of life, however, is now under threat. The use of voice recognition software has drastically increased over the last 2 years, with words such as ‘Alexa’ and ‘Siri’ being given new meaning. Is it now not too crazy to think that soon even the keyboard may become an antique. In this blog I’ll quickly take you through using JavaScripts built-in speech recognition software, the Web Speech API. That means no libraries need to be imported or stylesheets linked to!

The Web Speech API allows fine control and flexibility over the speech recognition capabilities in Chrome and will not work on browsers such as Firefox, Safari and Internet Explorer. There are two components to this API:

Speech Recognition → this converts a speech input which is detected through the devices microphone and when the word/phrase is successfully recognised by the speech recognition service it is returned as a text string.
Speech Synthesis → text-to-speech component that allows programs to read out their text content (normally via the device’s default speech synthesiser).

Doesn’t sound too complicated does it! I therefore decided to create a relatively simple app that would incorporate both these components. The app will be a simple counter that on the command ‘Start’ would begin counting and on the command ‘Stop’ would stop counting. The final count would then be spoken by the device on which the app was running. So let’s begin…

First, I sorted out the functionality of the counter. I needed a function that would take in the current value of the counter, add a value of one to that number and then replace the initial value on the screen with the new incremented value. This function would also need to be called every second. Lucky for me a method exist in JavaScript called setInterval(func, delay), this method takes in a function and a delay. This delay is the time, in milliseconds (thousandths of a second), the timer should delay in between executions of the specified function or code.

I set an initial value of 0 to the counter, which is stored within the constant variable counterContainer, this value is then converted into a number, incremented and then replaces the initial value within the container. It does this every 1000 milliseconds, or 1 second, having the same functionality of a counter.

Next I needed a function that would then stop this setInterval() function leaving the final time on the screen. JavaScript once again to the rescue. It provides a function that does just this clearInterval(intervalId). This function takes as an argument the ID of the setInterval() method it intends to stop. When a setInterval() method is called it returns a numeric value that corresponds to the timer created by the call to setInterval(). When the function is called it can be assigned to a variable which is then passed to the clearInterval() method that will stop the intended timer.

The above code will however only stop the latest timer set as the variable intervalId is reassigned every time the startCounter() function is called with the latest timer’s unique ID. Now with all the functionality of a simple counter it was time to introduce voice recognition.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

I began by including these lines to feed the right objects to Chrome, and non-prefix browsers, like Firefox. It would also alert users who were using older browsers such as Safari that this application will not work on the version they were using:

With this done I could then define a speech recognition instance, which would allow me to control the recognition for this application. I also set a few other properties here:

SpeechRecognition.continuous → this determines whether the speech recognition software is continuously listening out for a speech input. Set to false will cause the software to stop listening out for speech after a certain amount of time of silence.
SpeechRecognition.lang → Sets the language of the recognition.
SpeechRecognition.interimResults → this property set to true means that the speech recognition software will continuously output the result of the speech input as the user is speaking. If set to true the software would wait for the user to stop speaking for around 4 seconds and then return the word/phrase spoken.
SpeechRecognition.maxAlternatives → Sets the number of alternative potential matches that should be returned per result. This can sometimes be useful, say if a result is not completely clear and you want to display a list if alternatives for the user to choose the correct one from. But in this application this is not needed and only the one result needs to be assessed.

Once the speech recognition is started, there are many event handlers that can be used to retrieve results, and other pieces of surrounding information. The most common one you'll probably use is SpeechRocognition.onresult, which is fired once some speech has been successfully converted into text. It is here I will insert the functions that control the functionality of my timer.

This process gave me the most difficulty as the event handler onresult returns an object SpeechRecognitionEvent. The SpeechRecognitionEvent.results property then returns a SpeechRecognitionResultList object containing SpeechRecognitionResultobjects. This list of objects, however, is very temperamental.

In this example I am saying the word ‘start’. Immediately the voice recognition software returns the first SpeechRecognitionResult object shown on the left. However, I have only began saying the word ‘Start’ so how can it possibly predict what I am about the say; ‘start’, ‘state’, ‘statement’, etc.. are all words I could possibly have said. The software knows this and so it returns a key, value pair corresponding toconfidence: 0.00999999.. and will therefore wait. The next object returned has now predicted the correct word, however, its confidence value is still very low and thus it still returns the key, value pair isFinal: false as the software is still trying to make grammatical sense of what I am saying and it’s quite possible that I have just started the phrase I am about to say. The final two object returned have a considerably higher confidence value related to them, as I have now stopped speaking and the software correctly assumes it is safe to say that I have spoken the single command word ‘Start’. Due to this it returns the key value pair isFinal: true and will begin listening for a second word/phrase increasing the length of the SpeechRecognitionResultList object by one.

The SpeechRecogonitionResultList object has a getter so it can be accessed like an array — so the [last] returns the SpeechRecognitionResult at the last position. Each SpeechRecognitionResultobject contains SpeechRecognitionAlternative objects that contain individual recognised words. These also have getters so they can be accessed like arrays — the [0]therefore returns the SpeechRecognitionAlternative at position 0. Due to us previously setting:

SpeechRecognition.maxAlternatives = 1

This will only just return a single SpeechRecognitionAlternative object, allowing us to easily target this first and only one — using [0]. We then return its transcript property to get a string containing the individual recognised result as a string (need to talk about getting the transcript and complicated bit and how it works)

Towards the end of this snippet of code, I call a function outputTimerResult() this is the Speech Synthesis part of the application. This function is only called if the result hasn’t already been spoken by the device. I had to include this ternary due to how the Speech Recognition onevent handler outputs its results, as previously discussed. This ternary stops the device saying the result multiple times.

This function is less complicated than previous ones and just involves creating an instance of SpeechSynthesisUtterance with the text you want your device to say. Then calling speechSynthesis.speak(outputText) passing the aforementioned instance through as the argument. The result being the text you had inputted spoken to you in a very computerised voice.

Additional event handlers that I used included SpeechRocognition.onnomatch and SpeechRecognition.onerror. These handlers are there in case of an error, ever through the speech recognition software not being able to identify the speech input or there is an actual error with the recognition software. The later case would fire if for example there is an error with the browser which is causing your devices microphone to not pick up a speech input.

So there it is, a fully functional voice controlled timer. I know this isn’t going to change the way people record time as it’s far off from being perfect. For example, the process of saying the command ‘Start’ and the timer actually starting is not instantaneous. However, it is possible to see the benefits of such an application; no longer do we need to have access to our thumbs and forefingers, it is as easy as ‘Start’.

I hope you have enjoyed reading this blog and please feel free to try out the code for yourself on GitHub, https://github.com/harrygturner/my-super-simple-voice-recognition-counter. Also, let me know how I could improve on the code that I have written, would love to hear any suggestions you may have.

Once again thank you for reading and happy coding!

< Harry />

Voice Recognition — The End of the Keyboard

Written by Harry Turner