Build an AI that can understand and speak back to you

Quentin Renard
7 min readOct 7, 2019

--

Building an AI that could understand and speak back to me has always been a dream of mine. Ever since I saw Iron Man’s Jarvis in action, the urge to actually try it became more intense. My dream has now become true.

In this story I’ll go through how to build an AI that can understand and speak back to you using astibob and golang. It will be able to repeat what you’re saying.

For teaching purposes we’ll split abilities in different workers, but bare in mind that all abilities can be on the same worker if it makes more sense.

You can find the final code here. If you’re experiencing any problem, please create an issue here.

Overview

First off, we need to understand how astibob works, at least superficially. If you want to dig deeper, check out the project directly.

  • humans operate the AI through the Web UI
  • the Web UI interacts with the AI through the Index
  • the Index keeps an updated list of all Workers and forwards Web UI messages to Workers and vice versa
  • Workers have one or more Abilities and are usually located on different machines
  • Abilities run simple tasks such as reading an audio input (e.g. a microphone), executing speech-to-text analyses or doing speech-synthesis
  • Abilities can communicate directly between each other even if on different Workers
  • all communication is done via JSON messages exchanged through HTTP or Websocket

Create the Index

The first step is to create the Index:

Run it and visit http://127.0.0.1:4000 with username admin and password admin.

You should see something like this:

Congrats, the Index is now running and waiting for Workers to register!

Create the first Worker

The first Worker will allow you to run speech-synthesis which means allow the AI to speak to you:

After installing the proper dependencies, run it and refresh the Web UI.

You should now see something like this:

Congrats, the first Worker is now running and has registered to the Index!

Start the “Text to Speech” ability

The toggle in the menu on the left is red which means that the ability is stopped. Abilities are stopped by default.

If you want one of your ability to start when the worker starts you can use the AutoStart attribute of worker.Runnable.

Start the ability manually by clicking the toggle next to its name in the menu: it should slide, turn green and you should hear “Hello World”:

You can turn it off/on anytime by clicking the toggle again.

Congrats, you’ve started the “Text to Speech” ability and the AI can now speak to you!

Create the second Worker

The second Worker will allow you to read an audio input which means allow the AI to listen to you through your microphone:

After installing the proper dependencies, run it and refresh the Web UI.

NOTE: you may have to tweak StreamOptions values depending on your setup.

You should now see something like this:

Congrats, the second Worker is now running and has registered to the Index! Furthermore the “Audio Input” ability has been started automatically!

Calibrate the “Audio Input” ability

In order to detect spoken words, we need to detect silences.

In order to detect silences, we need to know the maximum audio level of a silence which is specific to your audio input.

Fortunately, the Web UI provides an easy way to do that.

First off, make sure the ability is running.

Then in the menu click Audio Input, then Calibrate, say something and wait about 5 seconds.

You should now see something like this:

You can see that in my case the maximum audio level is 67477168 and the suggested maximum silence audio level is 20243150.

However based on the chart and the words I’ve spoken, I’d rather set the maximum silence audio level to 5000000.

Based on your calibration results, determine the best value for your setup, update the proper option in the Worker and restart it.

Congrats, you’ve calibrated the “Audio Input” ability and the AI is ready to listen to you!

Create the third Worker

The third Worker will allow you to execute speech-to-text analyses which means allow the AI to understand spoken words. Detected words will then be sent to the first Worker so that it repeats them out loud:

After installing the proper dependencies and replacing this constant with the path to the directory you’ve installed dependencies in, run it and refresh the Web UI.

You should now see something like this:

Congrats, the third Worker is now running and has registered to the Index!

Enjoy the power of astibob!

Say Hello World out loud.

If your English accent is not as bad as mine, the AI should have repeated it!

It means the AI has heard your voice, understood it and in turn said Hello World out loud!

You can try with any other words, the AI should repeat them.

Congrats, you’ve built an AI that can repeat what you’re saying!

Train the “Speech to Text” ability to understand your voice

If, like me, English is not your native tongue or you simply want the AI to be trained for your voice, you can train the ability to understand your voice.

Bare in mind that to achieve minimal effectiveness, you will need a lot (like A LOT) of data.

Build your dataset

Our first goal is to build a dataset specific to your voice.

For that, in the menu, click Speech to Text and Build your dataset.

Then make sure to enable the Store new speeches options:

Congrats, all audio samples that are not silences will now be stored locally!

You can now start saying words (you need to wait for at least 1 second between each words/sentences).

Soon you’ll see something like this:

Congrats, new speeches are available for validation!

Validate your dataset

NOTE1: you may need to activate audio in your browser’s tab

NOTE2: some browsers (e.g. Firefox) may not play .wav files > 16bits. In my experience Chrome works fine.

Now you need to provide the exact transcript for each and every stored speech.

Also, chances are samples of random noises were stored therefore you also need to remove useless samples.

For that:

  • click the input of the first item
  • listen to the audio that should autoplay (if it doesn’t, click on the play icon)
  • write the transcript and press Enter
  • if you wish to remove it, press Ctrl + Enter

You should now see something like this:

Congrats, you’ve validated your dataset!

If you wish to correct validated speeches:

  • click the input of the item you want to edit
  • listen to the audio that should autoplay (if it doesn’t, click on the play icon)
  • edit the transcript and press Enter
  • if you wish to remove it, press Ctrl + Enter

Train your dataset

Now that your dataset has been validated, you need to train it.

For that, in the menu, click Speech to Text, Train your Dataset and Train.

You should see something like this:

Congrats, you’re training your dataset!

Once it’s done, the generated model will be located here.

Replace en with custom here and restart the worker.

Congrats, you’re now using a model trained for your voice!

Conclusion

Here you have it: an AI that can repeat what you’re saying! It can listen, understand and speak back to you.

Of course, it’s not the end of the road for your AI. I’m pretty sure at this point you have tons of ideas about adding more abilities and making your AI awesome!

Check out the list of official abilities here, and if the ability you need is not listed there, check out how to create your own here! If you feel this ability could be of interest to the community, create a PR here.

Here’s a list of ideas I will be trying out in the near future:

  • Translation: speak and wait for the AI to translate out loud what you’ve just said
  • Drone: pilot your drone with your voice
  • Subtitle: feed an audio file containing human voices to the AI and get a subtitle file in return

Happy AI coding!

--

--