How to create a Chabot in Python II

Developing NLP based chatbot using efficient Transformer PART-2

SARVESH KUMAR SHARMA
Analytics Vidhya
16 min readApr 29, 2021

--

Lets first recall the things we have discussed in the first part of

how to create chatbot in Python.

data-set and source-code at - https://github.com/shsarv/ChatBot

We get to know that ChatBot is an artificial individual or human who interacts with the human beings or other bot. Chatbots use natural language processing (NLP) and advanced machine learning (ML) algorithms to learn from data insights & NLP is the computer’s ability to understand and process human speech, and respond in a language that’s understandable for humans. This way, it makes the interaction seem like communication between two humans. Also we get to know the Advantages of chabots like 24*7 support, Instant answers and Order without human help. Finally we started to create our own chatbot using Reformer or efficient Transformer where we have Explored the MultiWoz dataset and saw different aspects of the MultiWoz dataset.

Now, we will start the second phase of our Chatbot creation where we will Process the data to feed it into the model, Train our model &Generate a dialogue by feeding a question to the model. These are the steps we are going to perform to complete these tasks.

  • Processing the data for Reformer inputs — Tokenizing, batching with bucketing
  • Reversible layers
  • Reversible layers and randomness
  • ReformerLM Training
  • Decode from a pretrained model.

2. Processing the data for Reformer inputs.

We will now use the get_conversation() function to process the data. The Reformer expects inputs of this form:

Person 1: Why am I so happy? Person 2: Because you are learning how o create a chatbot. Person 1: … Person 2: …*

And the conversation keeps going with some text. As you can see ‘Person 1’ and ‘Person 2’ act as delimiters so the model automatically recognizes the person and who is talking. It can then come up with the corresponding text responses for each person. Let’s proceed to process the text in this fashion for the Reformer. First, let’s grab all the conversation strings from all dialogue files and put them in a list.

# the keys are the file names
all_files = DIALOGUE_DB.keys()

# initialize empty list
untokenized_data = []

# loop over all files
for file in all_files:
# this is the graded function you coded
# returns a string delimited by Person 1 and Person 2
result = get_conversation(file, DIALOGUE_DB)

# append to the list
untokenized_data.append(result)

# print the first element to check if it's the same as the one we got before
print(untokenized_data[0])

output-

Person 1: am looking for a place to to stay that has cheap price range it should be in a type of hotel Person 2: Okay, do you have a specific area you want to stay in? Person 1: no, i just need to make sure it's cheap. oh, and i need parking Person 2: I found 1 cheap hotel for you that includes parking. Do you like me to book it? Person 1: Yes, please. 6 people 3 nights starting on tuesday. Person 2: I am sorry but I wasn't able to book that for you for Tuesday. Is there another day you would like to stay or perhaps a shorter stay? Person 1: how about only 2 nights. Person 2: Booking was successful.
Reference number is : 7GAWK763. Anything else I can do for you? Person 1: No, that will be all. Good bye. Person 2: Thank you for using our services.

Now let us split the list to a train and eval dataset.

# shuffle the list we generated above
random.shuffle(untokenized_data)

cut_off = int(len(untokenized_data) * .05)

# slice the list. the last elements after the cut_off value will be the eval set. the rest is for training.
train_data, eval_data = untokenized_data[:-cut_off], untokenized_data[-cut_off:]

print(f'number of conversations in the data set: {len(untokenized_data)}')
print(f'number of conversations in train set: {len(train_data)}')
print(f'number of conversations in eval set: {len(eval_data)}')

output -

number of conversations in the data set: 10438
number of conversations in train set: 9917
number of conversations in eval set: 521

so, our training set has number of conversation equals 9917 and number of conversations in testing set equals 521.

2.1 Tokenizing, batching with bucketing

We can now proceed in generating tokenized batches of our data. Let’s first define a utility generator function to yield elements from our data sets:

def stream(data):
# loop over the entire data
while True:
# get a random element
d = random.choice(data)

# yield a tuple pair of identical values
# (i.e. our inputs to the model will also be our targets during training)
yield (d, d)

Now let’s define our data pipeline for tokenizing and batching our data. we will bucket by length and also have an upper bound on the token length. we will use trax, which will allow us to use combinators to generate out data pipeline. we will then tokenize the data and filter out the long sequences. Finally we will apply the data pipeline to our train and eval sets.

data_pipeline = trax.data.Serial(
# randomize the stream
trax.data.Shuffle(),

# tokenize the data
trax.data.Tokenize(vocab_dir=VOCAB_DIR,
vocab_file=VOCAB_FILE),

# filter too long sequences
trax.data.FilterByLength(2048),

# bucket by length
trax.data.BucketByLength(boundaries=[128, 256, 512, 1024],
batch_sizes=[16, 8, 4, 2, 1]),

# add loss weights but do not add it to the padding tokens (i.e. 0)
trax.data.AddLossWeights(id_to_mask=0)
)

train_stream = data_pipeline(stream(train_data))
eval_stream = data_pipeline(stream(eval_data))

Let’s, Peek into the train stream.the stream generators will yield (input, target, weights). let’s just grab the input for inspection

inp, _, _ = next(train_stream)

# print the shape. format is (batch size, token length)
print("input shape: ", inp.shape)

# detokenize the first element
print(trax.data.detokenize(inp[0], vocab_dir=VOCAB_DIR, vocab_file=VOCAB_FILE))

output

input shape:  (2, 1024)
Person 1: Well, I am planning a trip and need some help with a train. Person 2: Of course, do you know your departure location and time? Person 1: I weill departing on thursday from cambridge and need to arrive by 10:30 in stevenage. Person 2: I have three, leaving between 5:21 and 9:21. Do you have a preference? Person 1: Not really. I need to know how much a ticket costs and how long it travels. Person 2: Train TR0552 arrives by 10:10 the ticket price is 12.80 pounds and the travel time is 49 minutes. Person 1: Perfect. I am also looking for a place to stay with free parking Person 2: No problem, how many nights will you be staying? Person 1: I'm not sure of that yet. It does need to be in the north. Person 2: Would you prefer a guest house or hotel? Person 1: I would like a hotel in the north, the star of the hotel and free internet. Person 2: how about acorn guest house? it's 4 stars. Person 1: Thank you, I'll take it. Can you book me for that hotel? Person 2: I would be happy to- I just need to know for which nights and for how many people. Person 1: Just for myself. And say, 2 nights ought to do it. Person 2: 2 nights starting on Thursday? Person 1: Actually I am just calling for information not a booking. I need a hotel, not guesthouse, in the north with free parking. Can you recommend a hotel? Person 2: I don't have any hotels in the north that meet your criteria. Would you like me to look in a different area? Person 1: I'm really needing something in the north. Please try again. Person 2: I just double checked. Still no hotels in the north that meet your criteria. What about a guesthouse? Person 1: No, I need a hotel in the north with free parking, no other criteria. I don't need free internet. Person 2: I'm sorry but I don't have anything meeting that criteria. Person 1: Well okay then let's just go with whatever's available in the north. Person 2: Does the number of stars matter? Person 1: Not really, can you give me the number of stars and whether or not they have internet? Person 2: Ashley hotel, it has two stars and yes they have internet Person 1: Sweet. That's all I needed then. Person 2: Thank you for calling today. Please call again if you have anything else that you need. Goodbye.

Part 3: Reversible layers

When running large deep models, We will often run out of memory as each layer allocates memory to store activations for use in backpropagation. To save this resource, We need to be able to recompute these activations during the backward pass without storing them during the forward pass. Take a look first at the leftmost diagram below.

This is how the residual networks are implemented in the standard Transformer. It follows that, given F() is Attention and G() is Feed-forward(FF). :

As you can see, it requires that x and y(a) be saved so it can be used during backpropagation. We want to avoid this to conserve memory and this is where reversible residual connections come in. They are shown in the middle and rightmost diagrams above. The key idea is that we will start with two copies of the input to the model and at each layer we will only update one of them. The activations that we don’t update are the ones that will be used to compute the residuals. Now in this reversible set up you get the following instead:

To recover (x1,x2)from (y1,y2)

With this configuration, we’re now able to run the network fully in reverse. You’ll notice that during the backward pass, x1 and x2 can be recomputed based solely on the values of y1 and y2. No need to save it during the forward pass.

Now, We will implement the reversible_layer_forward function using equations above. This function takes in the input vector x and the functions f and g and returns the concatenation of y1 and y2. For this , we will be splitting x before going through the reversible residual steps. We can then use those two vectors for the reversible_layer_reverse function. we will utilize np.concatenate() to form the output being careful to match the axis of the np.split().

def reversible_layer_forward(x, f, g):
"""
Args:
x (np.array): an input vector or matrix
f (function): a function which operates on a vector/matrix
g (function): a function which operates on a vector/matrix
Returns:
y (np.array): an output vector or matrix whose form is determined by 'x', f and g
"""
# split the input vector into two (* along the last axis because it is the depth dimension)
x1, x2 = np.split(x, 2, axis=-1)
y1 = x1 + f(x2)
y2 = x2 + g(y1)

# concatenate y1 and y2 along the depth dimension. be sure output is of type np.ndarray
y = np.concatenate([y1, y2], axis=-1)
return y

we will now implement the reversible_layer_reverse function which is possible because at every time step you have x1 and x2 and y1 and y2, along with the function f, and g. Where f is the attention and g is the feedforward.

def reversible_layer_reverse(y, f, g):
"""
Args:
y (np.array): an input vector or matrix
f (function): a function which operates on a vector/matrix of the form of 'y'
g (function): a function which operates on a vector/matrix of the form of 'y'
Returns:
y (np.array): an output vector or matrix whose form is determined by 'y', f and g
"""

# split the input vector into two (* along the last axis because it is the depth dimension)
y1, y2 = np.split(y, 2, axis=-1)
x2 = y2 - g(y1)
x1 = y1 - f(x2)
# concatenate x1 and x2 along the depth dimension
x = np.concatenate([x1, x2], axis=-1)
return x

3.1 Reversible layers and randomness

We will use fastmath’s random functions and keys & utilizing the same key, trax.fastmath.random.uniform() will return the same values. This is required for the backward pass to return the correct layer inputs when random noise is introduced in the layer.

# Layers like dropout have noise, so let's simulate it here:
f = lambda x: x + np.random.uniform(size=x.shape)

# See that the above doesn't work any more:
output_vector = reversible_layer_forward(input_vector, f, g)
reversed_vector = reversible_layer_reverse(output_vector, f, g)

assert not np.allclose(reversed_vector, input_vector) # Fails!!

# It failed because the noise when reversing used a different random seed.

random_seed = 27686
rng = trax.fastmath.random.get_prng(random_seed)
f = lambda x: x + trax.fastmath.random.uniform(key=rng, shape=x.shape)

# See that it works now as the same rng is used on forward and reverse.
output_vector = reversible_layer_forward(input_vector, f, g)
reversed_vector = reversible_layer_reverse(output_vector, f, g)

assert np.allclose(reversed_vector, input_vector, atol=1e-07)

Part 4: ReformerLM Training

We will now proceed to training your model. Since you have already know the two main components that differentiates it from the standard Transformer, LSH and reversible layers above, We can just use the pre-built model already implemented in Trax. It will have this architecture:

Similar to the Transformer ,we want to apply an attention and feed forward layer to our inputs. For the Reformer, we improve the memory efficiency by using reversible decoder blocks and you can picture its implementation in Trax like below:

You can see that it takes the initial inputs x1 and x2 and does the first equation of the reversible networks you get in Part 3. As you've also get, the reversible residual has two equations for the forward-pass so doing just one of them will just constitute half of the reversible decoder block. Before doing the second equation (i.e. second half of the reversible residual), it first needs to swap the elements to take into account the stack semantics in Trax. It simply puts x2 on top of the stack so it can be fed to the add block of the half-residual layer. It then swaps the two outputs again so it can be fed to the next layer of the network. All of these arrives at the two equations in Part 3 and it can be used to recompute the activations during the backward pass.

now we will Implement a wrapper function that returns a Reformer Language Model. We can use Trax’s ReformerLM to do this quickly. It will have the same architecture as shown above.

def ReformerLM(vocab_size=33000, n_layers=2, mode='train', attention_type=tl.SelfAttention):
"""
Args:
vocab_size (int): size of the vocabulary
n_layers (int): number of decoder layers
mode (string): setting of the model which can be 'train', 'eval', or 'predict'
attention_type(class): attention class to use
Returns:
model (ReformerLM): a reformer language model implemented in Trax
"""
# initialize an instance of Trax's ReformerLM class
model = trax.models.reformer.ReformerLM(
# set vocab size
vocab_size=vocab_size,
# set number of layers
n_layers=n_layers,
# set mode
mode=mode,
# set attention type
attention_type=attention_type
)
return model
# display the model
temp_model = ReformerLM('train')
print(str(temp_model))

# free memory
del temp_model

You will now write a function that takes in our model and trains it. We will Implement the training_loop to train the neural network above. Here is a list of things we should do:

  • Create TrainTask and EvalTask
  • Create the training loop trax.supervised.training.Loop

Pass in the following depending to train_task :

  • labeled_data=train_gen
  • loss_layer=tl.CrossEntropyLoss()
  • optimizer=trax.optimizers.Adam(0.01)
  • lr_schedule=lr_schedule
  • n_steps_per_checkpoint=10

we will be using CrossEntropyLoss loss function with Adam optimizer. Please read the trax documentation to get a full understanding.

Pass in the following to eval_task:

  • labeled_data=eval_gen
  • metrics=[tl.CrossEntropyLoss(), tl.Accuracy()]

This function should return a training.Loop object. To read more about this check the docs.

def training_loop(ReformerLM, train_gen, eval_gen, output_dir = "./model/"):
"""
Args:
ReformerLM: the Reformer language model you are building
train_gen (generator): train data generator.
eval_gen (generator): Validation generator.
output_dir (string): Path to save the model output. Defaults to './model/'.

Returns:
trax.supervised.training.Loop: Training loop for the model.
"""

# use the warmup_and_rsqrt_decay learning rate schedule
lr_schedule = trax.lr.warmup_and_rsqrt_decay(
n_warmup_steps=1000, max_value=0.01)

# define the train task
train_task = training.TrainTask(
# labeled data
labeled_data=train_gen,
# loss layer
loss_layer=tl.CrossEntropyLoss(),
# optimizer
optimizer=trax.optimizers.Adam(0.01),
# lr_schedule
lr_schedule=lr_schedule,
# n_steps
n_steps_per_checkpoint=10
)

# define the eval task
eval_task = training.EvalTask(
# labeled data
labeled_data=eval_gen,
# metrics
metrics=[tl.CrossEntropyLoss(), tl.Accuracy()]
)

loop = training.Loop(ReformerLM(mode='train'),
train_task,
eval_tasks=[eval_task],
output_dir=output_dir)
return loop

Part 5: Decode from a pretrained model

We will now proceed on decoding using the model architecture you just implemented. we will be using the autoregressive_sample_stream() decoding method from Trax to do fast inference. Let’s define a few parameters to initialize our model.

# define the `predict_mem_len` and `predict_drop_len` of tl.SelfAttention
def attention(*args, **kwargs):
# number of input positions to remember in a cache when doing fast inference.
kwargs['predict_mem_len'] = 120
# number of input elements to drop once the fast inference input cache fills up.
kwargs['predict_drop_len'] = 120
# return the attention layer with the parameters defined above
return tl.SelfAttention(*args, **kwargs)

# define the model using the ReformerLM function we implemented earlier.
model = ReformerLM(
vocab_size=33000,
n_layers=6,
mode='predict',
attention_type=attention,
)

# define an input signature so we can initialize our model. shape will be (1, 1) and the data type is int32.
shape11 = trax.shapes.ShapeDtype((1, 1), dtype=np.int32)

We can now initialize our model from a file containing the pretrained weights. We will save this starting state so we can reset the model state when we generate a new conversation. This will become clearer in the generate_dialogue() function later.

# initialize from file
model.init_from_file('chatbot_model1.pkl.gz',
weights_only=True, input_signature=shape11)

# save the starting state
STARTING_STATE = model.state

Let’s define a few utility functions as well to help us tokenize and detokenize. We can use the tokenize() and detokenize() from trax.data.tf_inputs to do this.

def tokenize(sentence, vocab_file, vocab_dir):
return list(trax.data.tokenize(iter([sentence]), vocab_file=vocab_file, vocab_dir=vocab_dir))[0]

def detokenize(tokens, vocab_file, vocab_dir):
return trax.data.detokenize(tokens, vocab_file=vocab_file, vocab_dir=vocab_dir)

We are now ready to define our decoding function. This will return a generator that yields that next symbol output by the model. It will be able to predict the next words by just feeding it a starting sentence.

def ReformerLM_output_gen(ReformerLM, start_sentence, vocab_file, vocab_dir, temperature):
"""
Args:
ReformerLM: the Reformer language model you just trained
start_sentence (string): starting sentence of the conversation
vocab_file (string): vocabulary filename
vocab_dir (string): directory of the vocabulary file
temperature (float): parameter for sampling ranging from 0.0 to 1.0.
0.0: same as argmax, always pick the most probable token
1.0: sampling from the distribution (can sometimes say random things)

Returns:
generator: yields the next symbol generated by the model
"""

# Create input tokens using the the tokenize function
input_tokens = tokenize(start_sentence, vocab_file=vocab_file, vocab_dir=vocab_dir)

# Add batch dimension to array. Convert from (n,) to (x, n) where
# x is the batch size. Default is 1. (hint: you can use np.expand_dims() with axis=0)
input_tokens_with_batch = np.array(input_tokens)[None, :]

# call the autoregressive_sample_stream function from trax
output_gen = trax.supervised.decoding.autoregressive_sample_stream(
# model
ReformerLM,
# inputs will be the tokens with batch dimension
inputs=input_tokens_with_batch,
# temperature
temperature=temperature
)

return output_gen

unit testing -

import pickle

WEIGHTS_FROM_FILE = ()

with open('weights', 'rb') as file:
WEIGHTS_FROM_FILE = pickle.load(file)

shape11 = trax.shapes.ShapeDtype((1, 1), dtype=np.int32)

def attention(*args, **kwargs):
kwargs['predict_mem_len'] = 120
kwargs['predict_drop_len'] = 120
return tl.SelfAttention(*args, **kwargs)

test_model = ReformerLM(vocab_size=5, n_layers=1, mode='predict', attention_type=attention)

test_output_gen = ReformerLM_output_gen(test_model, "test", vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, temperature=0)

test_model.init_weights_and_state(shape11)

test_model.weights = WEIGHTS_FROM_FILE

output = []

for i in range(6):
output.append(next(test_output_gen)[0])

print(output)

# free memory
del test_model
del WEIGHTS_FROM_FILE
del test_output_gen

Great! Now you will be able to see the model in action. The utility function below will call the generator you just implemented and will just format the output to be easier to read.

shape11 = trax.shapes.ShapeDtype((1, 1), dtype=np.int32)

def attention(*args, **kwargs):
kwargs['predict_mem_len'] = 120 # max length for predictions
kwargs['predict_drop_len'] = 120 # never drop old stuff
return tl.SelfAttention(*args, **kwargs)

model = ReformerLM(
vocab_size=33000,
n_layers=6,
mode='predict',
attention_type=attention,
)

and

model.init_from_file('chatbot_model1.pkl.gz',
weights_only=True, input_signature=shape11)

STARTING_STATE = model.state

Also,

def generate_dialogue(ReformerLM, model_state, start_sentence, vocab_file, vocab_dir, max_len, temperature):
"""
Args:
ReformerLM: the Reformer language model you just trained
model_state (np.array): initial state of the model before decoding
start_sentence (string): starting sentence of the conversation
vocab_file (string): vocabulary filename
vocab_dir (string): directory of the vocabulary file
max_len (int): maximum number of tokens to generate
temperature (float): parameter for sampling ranging from 0.0 to 1.0.
0.0: same as argmax, always pick the most probable token
1.0: sampling from the distribution (can sometimes say random things)

Returns:
generator: yields the next symbol generated by the model
"""

# define the delimiters we used during training
delimiter_1 = 'Person 1: '
delimiter_2 = 'Person 2: '

# initialize detokenized output
sentence = ''

# token counter
counter = 0

# output tokens. we insert a ': ' for formatting
result = [tokenize(': ', vocab_file=vocab_file, vocab_dir=vocab_dir)]

# reset the model state when starting a new dialogue
ReformerLM.state = model_state

# calls the output generator implemented earlier
output = ReformerLM_output_gen(ReformerLM, start_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, temperature=temperature)

# print the starting sentence
print(start_sentence.split(delimiter_2)[0].strip())

# loop below yields the next tokens until max_len is reached. the if-elif is just for prettifying the output.
for o in output:

result.append(o)

sentence = detokenize(np.concatenate(result, axis=0), vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR)

if sentence.endswith(delimiter_1):
sentence = sentence.split(delimiter_1)[0]
print(f'{delimiter_2}{sentence}')
sentence = ''
result.clear()

elif sentence.endswith(delimiter_2):
sentence = sentence.split(delimiter_2)[0]
print(f'{delimiter_1}{sentence}')
sentence = ''
result.clear()

counter += 1

if counter > max_len:
break

We can now feed in different starting sentences and see how the model generates the dialogue. You can even input your own starting sentence. Just remember to ask a question that covers the topics in the Multiwoz dataset so you can generate a meaningful conversation.

sample_sentence = ' Person 1: Are there theatres in town? Person 2: '
generate_dialogue(ReformerLM=model, model_state=STARTING_STATE, start_sentence=sample_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, max_len=120, temperature=0.2)

output -

Person 1: Are there theatres in town?
Person 2: : There are 4 theatres in town. Do you have a preference?
Person 1: Not really, can you recommend one and give me the address and postcode?
Person 2: How about the ADC Theatre located at Park Street?
Person 1: That sounds great. Can I get the postcode and phone number?
Person 2: The phone number is 01223300085. The postcode is cb58as.
Person 1: I also need a train to Cambridge on Thursday the week I will be traveling alone.

next,

sample_sentence = ' Person 1: Is there a hospital nearby? Person 2: '
generate_dialogue(ReformerLM=model, model_state=STARTING_STATE, start_sentence=sample_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, max_len=120, temperature=0.2)

output -

Person 1: Is there a hospital nearby?
Person 2: : Addensbrookes Hospital is located at Hills Rd, Cambridge, postcode CB20QQ. Do you need the phone number?
Person 1: No, but I do need the main phone number, please.
Person 2: The phone number is 01223245151.
Person 1: Thank you for your help.
Person 2: You're welcome. Have a nice day.
Person 1: Thank you for your help.
Person 1: You're welcome 43, Fensounds good!

and last,

sample_sentence = ' Person 1: Can you book a taxi? Person 2: '
generate_dialogue(ReformerLM=model, model_state=STARTING_STATE, start_sentence=sample_sentence, vocab_file=VOCAB_FILE, vocab_dir=VOCAB_DIR, max_len=120, temperature=0.2)

output -

Person 1: Can you book a taxi?
Person 2: : I sure can. Where would you like to be picked up?
Person 1: I'm going to be picked up from aylesbray lodge guest house.
Person 2: I'd be happy to help you. What time would you like to arrive?
Person 1: I need to leave after 11:00.
Person 2: Booking completed! Booked car type : grey ford
Contact number : 07262372

Person 1: I'm looking for a train to Cambridge on Saturday.

Congratulations! , you have just created an automated chabot. hope you have enjoyed the journey.

This story become lengthy but dividing it further might not be impactful as all the sections were co- related to each other.

--

--

SARVESH KUMAR SHARMA
Analytics Vidhya

A Data Scientist with broad-based experience in building data-intensive applications and overcoming complex architectural, and scalability issues.