Web Speech API + React Experiments

How to create a voice controlled TODO list in React

Published in

Hackmamba

6 min readJan 31, 2018

The other day I was looking through some new and exciting browser APIs which could be worth trying out. Now I’d like to share my findings about such API, which could be or at least become interesting in the future. You can find the link to the code for these experiments at the end of the article.

First thing which caught my attention was Web Bluetooth API. Unfortunately, my enthusiasm about it disappeared as quickly as it came to be. I stumbled upon some weird and very unstable behaviours, which just led to frustration as opposed to the excitement I was searching for.

Then I noticed Web Speech API and I decided to give it a shot. My hopes were not high, as the API is very experimental and so far has a very limited support. I didn’t try Edge, Firefox’s support is coming but for now only the speech synthesis part is supported. That leaves us with Chrome. Ok, well, why not. It’s just an experiment…

What is it about?

The Web Speech API enables you to incorporate voice data into web apps. The Web Speech API has two parts: SpeechSynthesis (Text-to-Speech), and SpeechRecognition (Asynchronous Speech Recognition.)

That being said, for me the more interesting of these two is the recognition part. In simple words, the interface gives you the ability to take the voice input and provides you with a transcript of it as the output. There are some customisation options, but in a basic setup it’s very easy to instantiate and use:

this.recognition = new webkitSpeechRecognition();
this.recognition.continuous = true;
this.recognition.interimResults = true;
this.recognition.lang = ‘en-US’;
this.recognition.onresult = event => {
  // do something with event.results
}

You can even define a specific SpeechGrammar in JSGF format.

Ok, so what are we going to do?

Let’s use this API to create a voice controlled TODO list in React.

Why TODO list? Because it’s one of the most basic things you can do with any framework or library. It serves as a showcase for the framework, like it’s shown here.

Why React? There’s a big hype around React! Seriously, it doesn’t give much added value in our example. But if you like it, why not try it. And i tried it with Typescript as well. Because, again, why not.

Let’s get started

If you’re not interested in the React part of presenting the TODOs, I advise you to skip to the Recognition section. Whether you know React already, or you don’t care, you will find nothing very interesting here.

Starting with React + Typescript combo is easy, all you need to do is the following:

npm install -g create-react-app
create-react-app my-app --scripts-version=react-scripts-ts

and boom! Everything is ready for you.

Let’s add some TODO logic now:

addTodo = (text: string) => {
  const todos = this.state.todos;
  const todo = { id: todoId++, text: text, completed: false };
  const newTodos = [...todos, todo];
  this.setState({ todos: newTodos });
  this.input!.value = '';
  this.input!.focus();
  return todos;
}

As you can see, we are doing it the “immutable” way, without manipulating the todos array directly.

In a similar way, we can add the logic to toggle the completed state of the todo:

toggle = (updatedTodo: Todo) => {
  const updatedTodos = this.state.todos.map(todo => {
    return (todo !== updatedTodo) ? todo : {
      ...todo,
      completed: !todo.completed,
    };
  });
  this.setState({ todos: updatedTodos });
  return updatedTodos;
}

I will not go in too much detail about this, in the end this is not the interesting part here.

We put it together in a view:

<form onSubmit={event => { this.addTodo(this.input!.value); event.preventDefault(); }}>
  <input ref={node => this.input = node} />
  <button type="submit">Add</button>
</form>
<TodoList todos={this.state.todos} onToggle={this.toggle} />

If you want to get more information about this part, there’s enough resources relating this topic for sure. The TODO part of the code is not meant to be pretty, its purpose is just the demonstration.

Recognition

We created a couple of components and services to make things a bit cleaner. Now we’ll have a look at SpeechProcessorService :

process(transcript: string) {
  if (this.state === States.LISTENING) {
    this.processListening(transcript);
  } else if (this.state === States.ADDING) {
    this.processAdding(transcript);
  }
}

We define states, where we either listen for the command, or listen for the text of the added task.

processListening(transcript: string) {
  if ((transcript.includes('new') || transcript.includes('another')) && transcript.includes('task')) {
    this.state = States.ADDING;
  } else if ((transcript.includes('complete') || transcript.includes('toggle')) && transcript.includes('task')) {
    this.processToggling(transcript);
  } else {
    this.state = States.LISTENING;
  }
}

The transcript which we receive from the interface is processed here. It’s a very naive and simple implementation, but it gets the job done. Of course assuming that you don’t have a funny dialect and your command is recognised. You can see it from the GIF that it didn’t always get the command fully right.

processAdding function is just calling the above mentioned addTodo function and changing the state:

processAdding(transcript: string) {
  this.todos = this.addTodoHandler(transcript);
  this.state = States.LISTENING;
}

What is more interesting is the toggling function:

processToggling(transcript: string) {
  const index = this.mapNumber(transcript);
  if (index === -1) { return; }
  const todo = this.todos[index];
  this.todos = this.toggleTodoHandler(todo);
}private mapNumber(transcript: string) {
  const numbers = [['one', 'first', '1'], ['two', 'second', '2'],                 ['three', 'third', '3'], ['fourth', '4'], ['five', 'fifth', '5']];
  return numbers.findIndex(numberSynonyms => numberSynonyms.some(synonym => transcript.includes(synonym)));
}

To be able to tell which todo item we want to toggle, we need to somehow map the number to an actual index of the todo. And because there are multiple ways of expressing it, we define an array with synonym which are remapped to the index based on the transcript. After that we just call the toggle handler and let it do its job.

That’s it! You can add and toggle your tasks with just your voice! Neat, right? … not really. Let’s take a retrospective look on what we did.

But how do I make it talk back?

The second part of the Web Speech API is the synthesis. We can make our TODO list talk to us in a few lines of code:

this.speaker = new SpeechSynthesisUtterance();
this.speaker.lang = 'en-US';
this.speaker.text = 'Your task was added';
speechSynthesis.speak(this.speaker);

Other thoughts

As i mentioned earlier, there are some customisation options for the recognition interface. You can specify whether you want just one results and then stop listening, or you want continuous results. You can play around with the confidence and multiple suggestions. You can add grammar rules, which probably make your recognition better (haven’t tried it yet).

I was quite surprised by the languages support. Of course it works for English, but i tried it also in German and even in Slovak. The results are probably the best in English, but nevertheless it’s quite an impressive list of supported languages on Chrome:

var langs =
[['Afrikaans',       ['af-ZA']],
 ['Bahasa Indonesia',['id-ID']],
 ['Bahasa Melayu',   ['ms-MY']],
 ['Català',          ['ca-ES']],
 ['Čeština',         ['cs-CZ']],
 ['Deutsch',         ['de-DE']],
 ['English',         ['en-AU', 'Australia'],
                     ['en-CA', 'Canada'],
                     ['en-IN', 'India'],
                     ['en-NZ', 'New Zealand'],
                     ['en-ZA', 'South Africa'],
                     ['en-GB', 'United Kingdom'],
                     ['en-US', 'United States']],
 ['Español',         ['es-AR', 'Argentina'],
                     ['es-BO', 'Bolivia'],
                     ['es-CL', 'Chile'],
                     ['es-CO', 'Colombia'],
                     ['es-CR', 'Costa Rica'],
                     ['es-EC', 'Ecuador'],
                     ['es-SV', 'El Salvador'],
                     ['es-ES', 'España'],
                     ['es-US', 'Estados Unidos'],
                     ['es-GT', 'Guatemala'],
                     ['es-HN', 'Honduras'],
                     ['es-MX', 'México'],
                     ['es-NI', 'Nicaragua'],
                     ['es-PA', 'Panamá'],
                     ['es-PY', 'Paraguay'],
                     ['es-PE', 'Perú'],
                     ['es-PR', 'Puerto Rico'],
                     ['es-DO', 'República Dominicana'],
                     ['es-UY', 'Uruguay'],
                     ['es-VE', 'Venezuela']],
 ['Euskara',         ['eu-ES']],
 ['Français',        ['fr-FR']],
 ['Galego',          ['gl-ES']],
 ['Hrvatski',        ['hr_HR']],
 ['IsiZulu',         ['zu-ZA']],
 ['Íslenska',        ['is-IS']],
 ['Italiano',        ['it-IT', 'Italia'],
                     ['it-CH', 'Svizzera']],
 ['Magyar',          ['hu-HU']],
 ['Nederlands',      ['nl-NL']],
 ['Norsk bokmål',    ['nb-NO']],
 ['Polski',          ['pl-PL']],
 ['Português',       ['pt-BR', 'Brasil'],
                     ['pt-PT', 'Portugal']],
 ['Română',          ['ro-RO']],
 ['Slovenčina',      ['sk-SK']],
 ['Suomi',           ['fi-FI']],
 ['Svenska',         ['sv-SE']],
 ['Türkçe',          ['tr-TR']],
 ['български',       ['bg-BG']],
 ['Pусский',         ['ru-RU']],
 ['Српски',          ['sr-RS']],
 ['한국어',            ['ko-KR']],
 ['中文',             ['cmn-Hans-CN', '普通话 (中国大陆)'],
                     ['cmn-Hans-HK', '普通话 (香港)'],
                     ['cmn-Hant-TW', '中文 (台灣)'],
                     ['yue-Hant-HK', '粵語 (香港)']],
 ['日本語',           ['ja-JP']],
 ['Lingua latīna',   ['la']]];

Conclusion

We added a simple logic for processing of the transcript and transforming it to the commands for our todo app.

To conclude about our voice todo app:

Is it faster than just typing? No.
Does it work all the time? No.
Is the app useful? Absolutely not, at least in the current state.
But… Can it talk back? Yes!

Even though it’s still in early state, it kind of works already and I can imagine that one day it will become very interesting API, also having accessibility in mind. For production use, however, some more love is still needed.