Understanding Rasa Tensorflow intent classifier

3 min readApr 29, 2019

This post is about how Rasa AI chatbot uses StarSpace idea from Facebook AI Research team for intent classification using supervised embeddings.

StarSpace overview

StarSpace is a general-purpose neural model for efficient learning of entity embeddings for solving a wide variety of problems. One example of the problem is the intent classification for AI chatbot.

StarSpace embeds entities of different types into a vectorial embedding space hence the “star” (“*”, meaning all types) and “space” in the name, and in that common space compares them against each other.

For AI chatbot, the embedding intent classifier embeds user inputs and intent labels into the same space. Embeddings are vectorial representations of words or documents. User inputs can be described by a bag of words.

During the training of a model, user inputs should be compared and the following loss function should be minimized:

a are documents (bags-of-words)
b are labels (intents) from the training set
Negative entities b are sampled from the set of possible labels
(a,b) is positive entity pairs, comes directly from a training set (nlu_data) of labeled data specifying (a, b) pairs.
sim(·, ·) is the similarity function. By default, Rasa uses cosine similarity. Another possible value is “inner”
L is the loss function that compares the positive pair (a, b) with the negative pairs.

Create a neural network

First of all, a neural network (NN) should be created. Input is a vector representation for user requests. NN has layers:

The hidden densely-connected layer which produces output with dimension 256
Dropout layer with dropout rate = 0.2 which would drop out 20% of input units
The hidden densely-connected layer which produces output with dimension 128
Dropout layer with dropout rate = 0.2 which would drop out 20% of input units
The output layer with dimension 20

The hidden densely-connected layer uses:

activation function is Rectified Linear Unit (ReLU) which computes rectified linear: max(features, 0)
kernel_regularizer: Regularizer function (L2 regularization) for the weight matrix with the scale of L2 regularization C2=0.002.

Cosine similarity

Cosine similarity is a measure of similarity between two non-zero vectors. It measures the cosine of the angle between them:

Cosine similarity is defined between embedded words and embedded intent labels:

Loss function

To optimize a model, you need to define the loss.

Default values:

mu_pos: 0.8 (should be 0.0 < … < 1.0 for ‘cosine’) is how similar the algorithm should try to make embedding vectors for correct intent labels
mu_neg: -0.4 (should be -1.0 < … < 1.0 for ‘cosine’) is maximum negative similarity for incorrect intent labels

Train

Optimizers incrementally change each variable in order to minimize the loss.

AdamOptimizer is an optimizer that implements the Adam algorithm

This code builds all the graph components necessary for the optimization using AdamOptimizer, and run Tensorflow training operation: