How to write a modular Alexa clone in 40 lines of JavaScript
For a while, I have been looking around for a decent speech-to-text library which is easy to setup, easy to use, and free. After searching for a while I gave up and started working on another project. Until yesterday, when I stumbled upon this awesome tutorial by Wes Bos.
This article will help you write a basic virtual assistant which will work in Chrome. For this example, you will have to serve your files on a server. I will use a basic preact app for this purpose since it makes it easy to set up a server and it makes it possible tweak some UI afterward without getting a headache. A demo of what will be created here can be found on robinjs.party.
Update: It is not necessary to use a framework like preact, however, I did choose to do this so I don’t have to manually setup a project, bundler, and babel. There is no reason you have to use preact. If you don’t want to use it, just skip the next paragraph and dive right into ‘Create the heart of the assistant’. Thanks for the feedback!
Create a new preact app
Let’s set up a basic preact app with the create-preact-app command. This makes it possible to build a new app pretty quick.
We will create a new file in the source folder called assistant.js
, together with a new directory named skills
which we will use to place our skills in. You can find all the code of this article in the robinjs-website repository.
Create the heart of the assistant
Let’s start out by defining the heart of our virtual assistant, which accepts a custom configuration. This makes it possible to create your own assistant with another name and speaks the language of your choice.
The virtual assistant should be able to convert an input to an answer. After the answer is obtained, the assistant presents it to the user. I call these two processes process
and say
. The assistant starts by processing the input and then say the output. For now, we will make our assistant log his replies and always provide the default answer reply
to the user, which we have specified in our configuration.
Awesome! We can now provide sentences to our assistant and he will output his answers to the console. However, our assistant is still useless. Let’s fix this by adding some skills that can generate dynamic replies.
Create some skills
A skill is a basic set of two functions — one which determines whether or not this skill should be triggered based on input and one that converts the actual input to the answer. The two functions, which I call trigger
and resolve
, are the only ones necessary in order to create a skill.
In this example, I will make the trigger function synchronous while the resolve function will by asynchronous. One might choose to make the trigger function asynchronous as well, however, we won’t use this type of triggers for now.
Let’s create a skills folder with two files — time.js
and whatsup.js
. These two basic skills will be used for our assistant.
The final step of creating the processor is to find the correct skill and call its resolve function. We can do this by updating the implementation of the process function like the snippet below.
Congratulations! You have finished the heart of your own (text-based) virtual assistant. You can now use the code above to create a new instance and load up your skills.
Listen and speak
Time for the last part of the implementation of the virtual assistant. This is where we make the assistant speak its answers to us and listens to what we have to say. Let’s start with the easy part, text-to-speech.
We will implement the say
function and make it use the built-in API in order to start speaking. From now on our code will only run in the browser since we will use variables which are attached to the window
variable. We use the regular expression /[&\/\\#,+()$~%.'"*?<>{}]/g
to filter out any characters that the assistant should nog pronounce.
Now that our assistant can speak, it still has to listen to what we have to say. In order to do so, we have to instantiate a webkitSpeechRecognition
instance, set its language to the one we have specified in our config, make sure to restart after it has heard a sentence and connect it to our process function.
This might sound like a lot, but it’s not that much at all. We can plug the code to set this all up right into the constructor when we create a new assistant. Note that we now have a callback where we receive a recognition instance, which we should convert to a string and pass to our process function. I have also added a simple start function to start the recognizer.
If you’d like to know more about the recognition variable, you can print the instance in the console in order to explore its structure. For now, I have already written some code to convert the instance to a transcript, which is the sentence which the recognizer has heard.
Once we have the sentence the recognizer has identified, we check if the first word is the name of our assistant. Only if this is the case, we pass the rest of the sentence to the process function. If the sentence does not start with the name of our assistant, we should simply ignore it.
Instantiate a new assistant
That’s it! You can start using your own assistant and keep adding different skills over time. Since every skill has the same API, you woule even be able to distribute and install skills via npm.
Source code
You can review all the code of this article in the robinjs-website repository. The example I have created here is also hosted on robinjs.party so you can run the demo right away.
Conclusion
This might be an extremely basic version of a virtual assistant, but because of the custom skills, there is really a lot you can do with it. I hope you liked the tutorial and have fun building your own version of Alexa!
If you have questions or you like this project, let me know!
Happy Coding 🎉