HTML5, NodeJS and Neural Networks: The tech behind MySam, an open source Siri

Recently I published the very first version of MySam, an open “intelligent” assistant for the web similar to Siri or Google Now. Unlike those however, you can teach Sam yourself, it works in many modern browsers and it is extensible with plugins written in HTML and JavaScript. Here is a video that shows what Sam can do:

It is a fun project that combines many of the open source projects I’ve recently been working on or interested in. In this short post I’d like to show how they all came together.

The Brain

The natural language understanding and learning process is probably the most interesting part. Sam uses natural language processing and machine learning to determine the probabilities of which of the previously learned actions to perform.

The NodeJS server runs natural-brain which combines node-natural, a natural language library with BrainJS, a neural network library for JavaScript. That means that, given some training data and an input text, you will get back probabilities with how likely it matches a certain label (in Sam’s case which action). Here is an example how natural-brain works:

node-natural already comes with two statistical models for language classification (naive bayes and logistic regression) but since the data format was also perfect to use with a neural network, connecting it to BrainJS was an interesting experiment. I do not have any hard numbers for comparison yet but at least in Sam’s case the neural network seemed to perform much better than the other two methods and the prediction accuracy and learning ability, especially once it had more data available, was more than surprising at times.

Compared to the language classification the tagging mechanism that extracts parts of a sentence is currently pretty primitive. Since we can assume that a classified sentence — provided a given confidence — is quite similar to the original training sentence it simply looks at the words around a tag that it has in common. From the video, if the training sentence “Is it cold in Sweden” with the location tag as “Sweden” is matched by the sentence “Do you know if it is cold in Canada” it will tag the words after “cold in” as the location. While not the most clever algorithm it works quite well mainly thanks to the natural language classification accuracy.

The API

Now that Sam can learn to make sense of language, getting it to communicate with the world is just as important. This is where Feathers comes in. Feathers is a service oriented REST and real-time API framework for NodeJS. This means that you can connect to Sam’s classifier through a RESTful API and in real-time via websockets.

The service layer is also where Sam stores its training and configuration data. To avoid having to set up your own database server it is using the filesystem based database NeDB through the feathers-nedb plugin. One really nice thing about Feathers is that the database can be swapped out for MongoDB, *SQL databases or even a remote API by just changing two lines of code. With MySam running, the API has two endpoints:

localhost:9090/actions where actions are stored. An action contains a training text, the action type name (potentially with additional parameters) and the location of the words in the training text belonging to a tag (-1 means the end of the sentence).

localhost:9090/classify is where classifications are sent to. A classification currently only has an input sentence. For example a JSON object like

{ "input": "Do you know the weather in Chicago" }

in a CURL request like this:

curl 'http://localhost:9090/classify/' -H 'Content-Type: application/json' — data-binary '{ "input": "Do you know the weather in Chicago" }'

Returns a classification similar to this:

It contains the same fields as the matched action (e.g. tags or text) and the input. In the classifications property you get a list of the confidence for each action id. As you can see, the action with id l0QZr5Ya52rHVk2B has a 35% confidence and is also the one that was matched. pos contains some part of speech information and extracted has all extracted tags (in our case Chicago for the location).

Technically, the natural language classification could also run by itself in the browser. Having a NodeJS server that is accessible through an API however has many advantages. It makes it possible to provide different frontends like a web-page, the Electron desktop, a mobile application or even connect a chat bot to it. It also allows you to create plugins for voice controlling local programs like iTunes or hardware like an Arduino robot.

The Frontend

A not yet too well known part of the HTML5 specification is the speech recognition API which is currently only supported in Webkit based browsers. It makes it very easy to add voice recognition to any web application and works extremely well — even with my odd German-Canadian English accent. So all that needs to be done to get Sam to classify a spoken sentence is to start the voice recognition and once completed, send the recognized text to the API.

The web-frontend itself is located in the mysam-frontend module. It is a DoneJS application which allows you to dynamically load globally installed plugins and register actions (a callback that gets the classification result and the main DOM element) and learners, which is the form that is shown when learning something new. Once a classification comes back and is past a certain confidence level (I found a 30% to be a good threshold) the frontend will look up the action and call it if it exists. There will be much more documentation around it soon but in its basic form creating your own plugin for Sam should be almost as easy as writing a jQuery plugin.

The Future

This article is only a very brief and not very in-depth overview of MySam’s technology. There is still much more to be said about the reasons why it exists at all and the possibilities of a truly open AI assistant. In the meantime, I invite you all try it out, teach it new things, write plugins, share your thoughts and hopefully be part of the beginning of a journey exploring different ways to interact with our computers.