Creating Your own Intent Classifier

Published in

Analytics Vidhya

6 min readOct 20, 2020

Being a fanboy of NLP, I always used to wonder how does the Google Assistant or Alexa understand when I asked it to do something. The question continued with if I could make my machine understand me too? The solution was -Intent Classification.

Intent classification is a part of Natural Language Understanding, where the machine learning/deep learning algorithm learns to classify a given phrase on the basis of the ones it has been trained on.

Let’s take a fun example; I’m making an assistant like Alexa.

For simplicity, we’ll take 3 tasks i.e. turn on the lights, turn them off and tell us what the weather is. Let’s give names to all the 3 - tasks TurnOnLights, TurnOffLights and Weather. These all tasks are called ‘intents’ in NLU. In other words, an intent is a group of similar phrases falling under a common name, so that it becomes easy for the deep learning algorithm to understand what the user has to say. Each intent is given certain number of training phrases so that it can learn to classify real-time phrases.

Now, that we know what intent classification is, let’s begin the cool stuff! I have written a notebook if you might want to walk along me, that you can find on my Github repo here.

For ease, let’s follow the following directory structure:

Your directory
├───models 
├───utils
└───intent_classification.ipynb

Installing the dependencies

Install the required dependencies using following command:

pip install wget tensorflow==1.5 pandas numpy keras

Dataset

We shall be using the CLINC150 Dataset that is available publicly. It is a collection of phrases for 150 different intents across 10 domains. You can read more about the dataset here.

We shall download the dataset using:

import wget
url = 'https://raw.githubusercontent.com/clinc/oos-eval/master/data/data_full.json'
wget.download(url)

Preparing the dataset

The dataset has already been split into ‘train’, ‘test ’and ‘validation ’ sets, but we shall create our own train and validation sets since we do not need a test set. We shall do this by merging all the sets and then splitting them using scikit-learn into ‘train’ and ‘validation’ sets. This will also create more training data.

import numpy as np
import json
# Loading json data
with open('data_full.json') as file:
  data = json.loads(file.read())

# Loading out-of-scope intent data
val_oos = np.array(data['oos_val'])
train_oos = np.array(data['oos_train'])
test_oos = np.array(data['oos_test'])

# Loading other intents data
val_others = np.array(data['val'])
train_others = np.array(data['train'])
test_others = np.array(data['test'])

# Merging out-of-scope and other intent data
val = np.concatenate([val_oos,val_others])
train = np.concatenate([train_oos,train_others])
test = np.concatenate([test_oos,test_others])data = np.concatenate([train,test,val])
data = data.T

text = data[0]
labels = data[1]

Next, we shall create the train and validation splits using:

from sklearn.model_selection import train_test_splittrain_txt,test_txt,train_label,test_labels = train_test_split(text,labels,test_size = 0.3)

Dataset preprocessing

Since deep learning is a game of numbers, it’d expect our data to be in numerical form to play with. We shall tokenize our dataset; meaning break the sentences into individuals and convert these individuals into numerical representations. We shall use the Keras Tokenizer to tokenize our phrases using the following code:

from tensorflow.python.keras.preprocessing.text import Tokenizer
from tensorflow.python.keras.preprocessing.sequence import pad_sequencesmax_num_words = 40000
classes = np.unique(labels)

tokenizer = Tokenizer(num_words=max_num_words)
tokenizer.fit_on_texts(train_txt)
word_index = tokenizer.word_index

To feed our data to the deep learning model, all our phrases must be of same length. We shall pad all our training phrases with 0 so that they become of same length.

ls=[]
for c in train_txt:
    ls.append(len(c.split()))
maxLen=int(np.percentile(ls, 98))train_sequences = tokenizer.texts_to_sequences(train_txt)
train_sequences = pad_sequences(train_sequences, maxlen=maxLen,              padding='post')test_sequences = tokenizer.texts_to_sequences(test_txt)
test_sequences = pad_sequences(test_sequences, maxlen=maxLen, padding='post')

Next, we need to convert our labels into one-hot encoded form. You can read more about one-hot encoding here.

from sklearn.preprocessing import OneHotEncoder,LabelEncoder

label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(classes)

onehot_encoder = OneHotEncoder(sparse=False)
integer_encoded = integer_encoded.reshape(len(integer_encoded), 1)
onehot_encoder.fit(integer_encoded)train_label_encoded = label_encoder.transform(train_label)
train_label_encoded = train_label_encoded.reshape(len(train_label_encoded), 1)
train_label = onehot_encoder.transform(train_label_encoded)test_labels_encoded = label_encoder.transform(test_labels)
test_labels_encoded = test_labels_encoded.reshape(len(test_labels_encoded), 1)
test_labels = onehot_encoder.transform(test_labels_encoded)

Before we create our Model..

Before we begin training our model, we shall use the Global Vectors. The GloVe is an N-dimensional vector representation of words trained on a large corpus by the Stanford University. Since it is trained on a large corpus, it will help the model to learn the phrases even better.

We’ll download GloVe using:

import wget
url ='https://www.dropbox.com/s/a247ju2qsczh0be/glove.6B.100d.txt?dl=1'
wget.download(url)

Once the download is complete, we’ll store it in a Python Dictionary:

embeddings_index={}
with open('glove.6B.100d.txt', encoding='utf8') as f:
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

Since GloVe contains vector representation of all the words from a large corpus, we’ll need only those word vectors that are present in our corpus. We shall create an embedding matrix that contains the vector representations of only the words that are present in our dataset. Since our dataset has already been tokenized, each token in the dataset is assigned a unique number by the Keras Tokenizer. This unique number can be considered as an index for each word’s vector in the embedding matrix; which means each nth word from the tokenizer is represented by a vector at nth position in the embedding matrix.

all_embs = np.stack(embeddings_index.values())
emb_mean,emb_std = all_embs.mean(), all_embs.std()num_words = min(max_num_words, len(word_index))+1embedding_dim=len(embeddings_index['the'])embedding_matrix = np.random.normal(emb_mean, emb_std, (num_words, embedding_dim))
for word, i in word_index.items():
    if i >= max_num_words:
        break
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embedding_matrix[i] = embedding_vector

Model Preparation

Let’s put our model’s architecture to see the model in action.

from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense, Input, Dropout, LSTM, Activation, Bidirectional,Embeddingmodel = Sequential()

model.add(Embedding(num_words, 100, trainable=False,input_length=train_sequences.shape[1], weights=[embedding_matrix]))
model.add(Bidirectional(LSTM(256, return_sequences=True, recurrent_dropout=0.1, dropout=0.1), 'concat'))
model.add(Dropout(0.3))
model.add(LSTM(256, return_sequences=False, recurrent_dropout=0.1, dropout=0.1))
model.add(Dropout(0.3))
model.add(Dense(50, activation='relu'))
model.add(Dropout(0.3))
model.add(Dense(classes.shape[0], activation='softmax'))model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

We shall pass the embedding matrix in the Embedding layer as weights.

Model Training

Finally, the time to train the model.

history = model.fit(train_sequences, train_label, epochs = 20,
          batch_size = 64, shuffle=True,
          validation_data=[test_sequences, test_labels])

This should take about an hour or so depending upon your machine. When the training completes, we can visualize the metrics as:

import matplotlib.pyplot as plt
%matplotlib inlineplt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.title('Model Accuracy')
plt.ylabel('Accuracy')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

Wohoo!! we get 92.45% training accuracy and 88.86% validation accuracy which is pretty decent.

Here is the loss curve:

import matplotlib.pyplot as plt
%matplotlib inlineplt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.title('Model Loss')
plt.ylabel('Loss')
plt.xlabel('Epoch')
plt.legend(['Train', 'Validation'], loc='upper left')
plt.show()

The training loss goes around 0.2 and the validation loss goes around 0.5 You can play with the model architecture and see if the loss goes even down😉

Saving Model, Tokenizer, Label Encoder and Labels

Let’s save the trained model, the tokenizer, the label encoder and the labels to use them in future cases.

import pickle
import jsonmodel.save('models/intents.h5')

with open('utils/classes.pkl','wb') as file:
   pickle.dump(classes,file)

with open('utils/tokenizer.pkl','wb') as file:
   pickle.dump(tokenizer,file)

with open('utils/label_encoder.pkl','wb') as file:
   pickle.dump(label_encoder,file)

Time to see everything in action

We’ve come a long journey.. let’s see how does the final destination look like.

I’ve created the following class to use our model:

import numpy as np
from tensorflow.python.keras.preprocessing.sequence import pad_sequencesclass IntentClassifier:
    def __init__(self,classes,model,tokenizer,label_encoder):
        self.classes = classes
        self.classifier = model
        self.tokenizer = tokenizer
        self.label_encoder = label_encoder

    def get_intent(self,text):
        self.text = [text]
        self.test_keras = self.tokenizer.texts_to_sequences(self.text)
        self.test_keras_sequence = pad_sequences(self.test_keras, maxlen=16, padding='post')
        self.pred = self.classifier.predict(self.test_keras_sequence)
        return self.label_encoder.inverse_transform(np.argmax(self.pred,1))[0]

To use the class, we shall load our saved files first:

import pickle

from tensorflow.python.keras.models import load_modelmodel = load_model('models/intents.h5')

with open('utils/classes.pkl','rb') as file:
  classes = pickle.load(file)

with open('utils/tokenizer.pkl','rb') as file:
  tokenizer = pickle.load(file)

with open('utils/label_encoder.pkl','rb') as file:
  label_encoder = pickle.load(file)

Time for the test!😋

nlu = IntentClassifier(classes,model,tokenizer,label_encoder)
print(nlu.get_intent("is it cold in India right now"))
# Prints 'weather'

That’s it folks! Thank you for reading😃. Happy learning!