Slot Filling using Sequence Models

Deepak Pandita
Holler Developers
Published in
5 min readJan 27, 2022

AI Research at Holler Technologies

Photo by Quaid Lagan on Unsplash

Introduction

Slot filling is the subsequent task to intent detection and is very critical in Natural Language Understanding. If you are unaware of intent detection, then I would suggest reading my previous article, Intent Detection using Sequence Models for an overview. Intent detection aims to recognize the intention of the user query whereas, in slot filling, we identify the slots in the user query. We can think of slots as the parameters of a user query e.g. “I want to watch The Matrix”. Here, the intent is to watch a movie, and “The Matrix” is the name of the movie that we would like to identify. Slot filling is a key component in task-oriented dialog systems and can be used to better understand search queries.

In this article, we first go through the slot filling problem setup and the dataset used. Next, we use sequence models to solve the problem. We develop our solution in Python using pandas, Tensorflow Keras, and scikit-learn libraries.

Problem

Slot filling is traditionally treated as a sequence labeling problem. Given an utterance, the task is to assign a label to each token in the utterance. The tokens are labeled using an IOB (Inside, Outside, and Beginning) encoding scheme. For example “The Matrix” will be labeled as B-movie_name I-movie_name.

Dataset

To solve this problem, we need a dataset containing utterances that are labeled with slots. For this article, we use the Snips dataset which is a widely used dataset for intent detection and slot filling benchmarking. We take a subset of 7874 utterances containing 35 slot types across 4 intent types (AddToPlayist, BookRestaurant, PlayMusic, and SearchScreeningEvent). The data is stored in plain text files. Let’s go ahead and load the dataset (Figure 1).

Figure 1: Sample Data

Data Preparation

After we load our dataset, we split it into train and test sets. We split the dataset into 80% train and 20% test split using the scikit-learn library. The RANDOM_STATE variable is used to make sure we can replicate the experiments.

(6299,) (1575,) (6299,) (1575,)

After splitting the dataset, we need to make sure it is in the format that can be accepted by our model. First, we initialize the tokenizers using the Tensorflow Keras library. We then use the tokenizers on our train and test datasets to get sequences. We need to pad the sequence to make sure that the length is the same across the samples and datasets. We limit the size of the vocabulary to NUM_WORDS. This allows us to handle out of vocabulary words seen during the test phase using the OOV_TOKEN. We also use one-hot vectors for the slot labels.

(6299, 35, 58) (1575, 35, 58)
(1575, 35, 1) (1575, 35, 1)

Training the Model

Now that the data is prepared, we are ready for the training phase. In this article, we train a bidirectional long short-term memory (BiLSTM) model by using the Sequential class from Tensorflow Keras and adding layers to it. We start with the Embedding layer to represent each word with a vector of length EMBEDDING_DIM. Then, we add a BiLSTM layer by utilizing the Bidirectional wrapper class for the LSTM layer with dimension NUM_UNITS and relu activation. To get the output, we need to use the Dense layer with the TimeDistributed layer as we want the model to produce slot labels at each timestep i.e. for each word. The Dense layer output size is the number of slot labels and has a softmax activation. Next, we train the BiLSTM model and use categorical cross-entropy loss and adam optimizer for training. We also use 10% of the data as the validation set.

We notice the loss on the training set and validation set decreasing as the training progresses. We stop after 7 epochs here but you can experiment with a greater number of epochs too.

Plot Learning Curves

After model training, we plot the learning curves by plotting the loss function for the training and validation sets.

Figure 2: Learning Curves

As shown in Figure 2, the loss function on the train and validation set keeps decreasing and starts to become stable with a small gap in-between. Therefore, we can say this is a good model for the data.

Evaluation

Now, let’s evaluate the performance of the model on the test set.

50/50 [==============================] - 0s 5ms/step - loss: 0.1515 - precision: 0.9787 - recall: 0.9531 - accuracy: 0.9624

As we can notice, the accuracy of the model is 96.24% on the test data but don’t let it mislead you. This metric is calculated on all the tokens and most of them will be the Outside (O) tokens. What we are really interested in here would be the performance on identifying the slots. An alternate way is to compute Precision, Recall, and F1-score over the spans of labels. However, it is out of scope for this article.

Prediction using the model

Let’s run our model on an example.

I want to watch The Matrix
['O', 'O', 'O', 'O', 'B-movie_name', 'I-movie_name']

The model correctly predicts the movie name in the input sentence as “The Matrix”.

Summary

In this article, we discussed the problem of slot filling, a key component for Natural Language Understanding. We prepared the data and implemented a BiLSTM model for labeling slots in a sentence. We briefly discussed the model performance and evaluation. We also demonstrated how the trained model can be used to predict slot labels for an input sentence.

A Jupyter Notebook containing the code can be found here.

Resources

If you’re interested in learning more, here are some resources:

References

Thanks for reading! If you have questions or if you would like us to write on anything, please drop a comment or reach out on my website.

--

--