Retelling tales as they are!

Sashank Kumar
16 min readMay 1, 2019

--

Often we come across someone who has been through a situation that has probably been life-changing, or substantial enough to leave a mark. They often narrate their tales in TEDx talks and guest lecture series around the world. They believe in sharing because they do not want us to be in a similar situation again. And more often than not, the speakers are successful and have the audience engrossed, with endless curiosity.

Why? We asked. The obvious answer was they could feel the connection listening to someone who has been through it rather than the narrator, or a speaker totally unrelated. With the advancement of technology, there might be a day, where mankind can recreate virtual avatars of influential people for the next generations to talk to, and interact. But these can be more effective when the avatar can actually talk just like the original person would.

Speeches and presentations in person. With technology racing ahead, a thing of the past?

On this note, we attempt to solve an end-to-end open domain dialogue system related problem. Our goal is to introduce a new approach to find the response in a multi-turn context and then use it to select the best matching response from a set of candidate utterances. Though similar to traditional response ranking objectives, our motivation differs in the sense that the model will be quoting a person, and hence has to be exact and loyal to the speaker’s responses.

Who’s mapping whom? What if they can start talking like you and me?

In such a scenario, generating a response based on the input context won’t be helpful and we ideally prefer to rank the candidate responses that have been gathered from him earlier. We believe that using the generated response for ranking should substantially enhance the quality of ranking.

Ultimately our goal comes down to designing a domain-specific end to end dialogue system for the task of selecting the next best response. This approach to a problem can also be used to solve many other existing problems in this line. Like with chatbots, that can interact better with a given context, and automated voice call assistants, health support devices for disabled people, and so on.

Applications of end-to-end dialog in everyday use

Data Set

To achieve the problem statement underlined above, we seek a large data set for research in dialogue systems that offer —

  • Two-way conversations, as opposed to multi-participant, preferably human-human, keeping in mind the original application that drove our research objective
  • A large number of conversations, typically in the order of 1 million
  • Several turn exchanges in the conversation, between speakers (ideally more than 3)
  • Task-specific domain, as opposed to chatbot conversations, which do not carry a lot of context for the model to learn from

We use the Ubuntu Data Dialogue Corpus for our study. The data set consists of one million multi-turn dialogue utterances where the conversations average to a length of 8 turns each, with a minimum of 3 turns. The data set has both the multi-turn property of conversations in the Dialog State Tracking Challenge data sets and the unstructured nature of interactions from microblog services such as Twitter.

A typical training example in our dataset would like this

Following which the transformer outputs a generated response to select from a list of candidate responses.

But this dataset possesses a few challenges, we enlist them below -

  • The conversations in this data set are often about technical issues, therefore jargons and unknown words are frequent. Moreover, domain knowledge is required in order to respond to them.
  • Similar to many other forums or blogs, conversations held in this dataset are casual, not following grammatical rules and contain spelling errors.

Problem Formulation

  • Input: Context C and a set of fixed responses: R = { r1, …, rn }. The context consists of multiple people conversing with at least 3 turn changes
  • Task: Select the most appropriate candidate response as an output.

This is clearly illustrated further in the figure below.

Fig : Illustration of our problem definition

We feed the context from the dataset to our model, which generates a response. We would, in turn, use this generated response to rank and select from a list of candidate responses. The Ubuntu dialogue corpus is tailor-made for such an application. The training dataset has samples with context and next response in the line, with labels as 0 or 1 representing if the response is relevant or not. We use only those samples that are in line with the context. The model we chose for this specific application is the Transformer.

Now that we have a generated response and set of candidate responses, we need a method to evaluate the similarity between them and pick the closest response. Ideally, we could use another transformer to solve this purpose making it completely end-to-end, and trainable to our dataset, but owing to the time and computational constraints, we decided to use pre-trained embeddings that we hoped to serve the purpose.

Transformer

(Just attention is all you need!)

The Transformer architecture, a design aimed to tackle the problem of sequence transduction. Sequence transduction, what are we looking at?

A simple visualization of different routines possible in a transduction architecture is shown here.

Ref: https://www.clsp.jhu.edu/workshops/18-workshop/grounded-sequence-sequence-transduction/

It is the task of transforming input sequences to output sequences using some mapping. Speech Recognition, Text-to-speech transformation, Machine Translation are few fields where we define this problem.

Let’s discuss the salient features of this architecture and what makes this design unique and useful — Until recently, complex RNN and CNN based encoder-decoder schemes have dominated transduction models heavily influencing research approaches in language modeling. But despite their widespread usage, they had one inherent problem, ‘Parallelization’.

Recurrent models owing to their sequential nature (computations focused on the position of the symbol in input and output) have made parallelization a problem. Although architectures like Wavenet, Bytenet or ConvS2S have negotiated the problem of sequential computation to a certain extent, the number of calculations in parallel computations of the hidden representation, for the input-output position in a sequence, grows exponentially with the distance between them.

In short, previous models have been :

  • Hard to implement and take advantage of parallelization efficiently
  • Backpropagation through sequence
  • Transmitting local and global information through one bottle-neck (hidden-state)

This problem with ever increasing computational speeds, and efficiency of machines that we have around has pushed for the need for an architecture that a Transformer truly offers through attention.

What does attention solve?

It eliminates the bottleneck problem of an Encoder-Decoder model and provides context for a given decode step. In addition, The Transformer reduces the number of sequential operations to relate two symbols from input/output sequences to a constant number of operations, using Multi-head attention. The novelty of this approach is in eliminating recurrence and replacing with attention to handle the dependencies between input and output.

Transformer — Architecture:

The Transformer retains the two major components in its network — an Encoder and a Decoder from previous architectures. The Transformer’s encoder maps every input sequence to a continuous representation ‘z’ which in turn is used by the decoder to generate output, one symbol at a time. Hence, the output of the encoder is a fixed size vector z that must encode the entire source sentence which includes the sentence meaning. This final state is therefore called sentence embedding.

The decoder is stacked directly on top of the encoder. Each of the encoder and decoder components has identical sublayers. Encoder sublayer contains two layers namely Multi-head attention and Position-wise feed-forward network, while the Decoder sublayer includes Multi-head attention, Encoder-Decoder attention, and Position-wise feed-forward network.

Fig 1: Architecture of the transformer

The specific architecture outlined in the original paper calls for 6 encoder sublayers within the encoder and 6 decoder sublayers within the decoder.

Fig 2: Self-attention

An encoder receives a list of vectors as an input. This list of vectors is then passed through the self-attention layer and to a feed-forward neural network. The output is then passed upwards to the next encoder sublayer.

Within the self-attention layer :

  • For every input vector of the list of vectors, we create three vectors namely Query, Key and Value vector.
  • These vectors are created by multiplying the embedding by three matrices that we trained during the training process.
  • Calculate the dot product of a particular word’s query vector with every key vector of the input words, and further scale and apply softmax.
  • Multiply each value vector by the respective softmax score and sum them up. This produces the output of the self-attention layer for the particular word.
  • The output of this self-attention is then fed to a feed-forward NN.

In figure 2, we can see the operation where a matrix of Query vector is being multiplied by a matrix of Key vectors, the output of this multiplication computes for the value of the dot-product discussed above. We apply softmax to obtain weights and compute in effect the weighted sum of value vectors.

Does Transformer consider the occurrence of a particular word based on position?

Yes, the transformer defines a Positional Encoding to account for the relative position of words in the input sequence. Since we are processing both the input and output sentences simultaneously, our model without this does not have an understanding of positions and places of words in sequence. Including Positional Encoding to the embeddings provides meaningful distances between the embedding vectors once they’re projected into (Q,K,V) vectors and during dot-product attention.

Positional Encoding

To give knowledge about positions the authors use a simple cosine based encoding method. The simple idea behind using cosines is like giving each position a vector of values in between -1 and +1. So each positional vector then behaves like a giant array of switches which are either off or on or somewhere in the middle.

Think of this in the following way, we have a large number of cosine functions of varying wavelengths/frequencies and for each position, we select a column corresponding to that index.

Ref: https://medium.com/datadriveninvestor/lets-build-attention-is-all-you-need-1-2-de377cebe22

Advantage of using the given formula is to be able to scale to unseen lengths of sequences longer than any of those in our training set.

Multi-head attention:

This is a substitute for the self-attention layer, where instead of having a single representation subspace we have multiple. For each word embedding, we shall have a different set of query, key and value vectors. For each set of vectors (Q, K, V), we compute a weighted value vector representing a single attention head. Based on our choice of the number of heads, we shall concatenate all the representations and project them onto another subspace that is fed to the feed-forward NN.

From each of our attention heads, we have the latent representation Zi. We concatenate all the representations and project it onto another subspace using W, to have the output of the multi-head attention layer.

Fig : Multi-head attention

Decoder:

The Decoder component also has 6 sublayers, each having a self-attention layer, Encoder — Decoder attention layer, and feedforward NN.

Fig : Transformer Architecture, decoder(right)

In Decoder, the self-attention layer is only allowed to attend to earlier positions in the output sequence. This is done by masking future positions(setting them to -Inf) before the softmax step in the self-attention calculation. The “Encoder-Decoder Attention” layer works just like multiheaded self-attention, except it creates its Queries matrix from the layer below it, and takes the Keys and Values matrix from the output of the encoder stack.

In the figure, the block on the right side shows the Masked-Multi-Head attention layer, capturing representations of the output symbol followed by encoder-decoder attention layer that draws Key values from the output of the Masked-Multi-Head attention layer, Query and Value vectors from the Encoder stack.

This transformer thus at the end generates a response based on the context that it has learned from the fed input data. We now move on to the next stage of our problem, formulating an approach to select the best response from the available set of candidates.

There have been many approaches to this problem in the past. People have used an explicit transformer based model architecture that would learn from the dataset and provide explicit results. Or they have suggested the use of embeddings ranging from word level to sentence level, like Elmo or BERT. Another variant of this approach is to fine-tune the pre-trained BERT sentence embedding to our dataset.

But like we discussed earlier, owing to the constraints we have on time and computational resources we have at the point of writing this blog, we restrict ourselves to using a pre-trained BERT architecture to generate sentence embeddings.

Sentence Similarity

After training the attention network with around 300k context — response pairs, we move ahead to evaluate our model. The whole idea of generating a context-based response is to effectively select from the list of available responses. The Ubuntu data set has a well-defined evaluation set, which suits our purpose like bread and butter.

Every sample in the test set has a context c’, an actual ground truth response ‘g’ and randomly sampled distractors d1’, d2’, d3’ . . . . . . . . d8’. We believe using the context-generated response from the generator to select from these set of available responses, and thereby evaluate performance.

But how do we compare the two sentences? They need not match word to word to convey the same meaning, nor do they have to have the exact same set of words. Such is the brilliance of English, or per se. the complexity of English, that there are many ways to convey what we feel.

Infact, I could’ve put this entire blog in a completely different tense, although that’s a thing for another night!

How do we take into account all these possibilities and still pick the right response? Fortunately for us, we didn’t have look farther than BERT — Bidirectional Encoder Representations from Transformers. BERT is conceptually simple and empirically powerful to use.

BERT input representation. A cumulative representation using Token Embeddings, Segment Embeddings and Position Embeddings

BERT’s model architecture is a multi-layer bidirectional Transformer encoder based on the original implementation described in Vaswani et al. BERT doesn’t use a traditional left-to-right or right-to-left models while being pre-trained. This makes sense, since intuitively, it is reasonable to believe that a deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to-right and right-to-left model.

Now that, we’ve had a little insight on how BERT works, moving on to the issue at hand. Once we have a generated response from the transformer, we create BERT embeddings for each of the utterances in candidate set, and compute a similarity score for all possible generated-candidate utterance pairs. The candidate utterance with maximum similarity score is then selected and passed.

Similarity Measure — There are multiple measures that can be used to check for similarity between two non-zero vectors. Owing to its widespread usage and simplicity, we use Cosine Similarity. It measures the similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. Or in other words, it looks at the orientation of these two vectors in space, and not their magnitude.

Cosine similarity measure for two non-zero vectors, A & B

One advantage of cosine similarity is its low-complexity, especially for sparse vectors: only the non-zero dimensions need to be considered.

Fig 5 : BERT Embeddings and Cosine Similarity to find the right candidate response

Previous approaches

With more recent developments in neural learning architecture, more approaches are proposed to exploit deep learning for end-to-end dialogue systems.

For the data gathered from Twitter, exploiting from a phrased based statistical Machine Translation, (Ritter et al., 2011) proposed a response generation model which was superior to the previous retrieval based approaches. They were one of the first ones who treated the task of question answering as a machine translation task.

(Lowe et al., 2015) presented a deep learning model (Dual-Encoder) for the task of choosing the best next response on Ubuntu Dialogue corpus, A large Dataset for Research in Unstructured Multi-Turn Dialogue Systems. Since we propose a new neural architecture for the same task, we use this model as our baseline model in this project.

Another recent breakthrough in the field of sentence-embedding is BERT (Devlin et al., 2018). With the aid of Transformers (Vaswani et al., 2017) it could reach to many accomplishments in completing natural language processing tasks. We are using both Transformer and Bert in the underlying of our deep neural net architecture.

One of the latest models for the task of choosing the next best response is introduced by (Zhou et al., 2018) which tries to match a response with its multi-turn context using dependency information retrieved based on attention. (Olabiyi et al., 2018) suggested an adversarial learning model for response generation. Their model generates longer and more informative responses using generative adversarial networks.

Unlike all the previous approaches, we are interested to examine the idea that what happens if we generate an utterance close to an acceptable response and then try to choose the next best response based on the generated one. The generated utterance may or may not be an appropriate response itself, however, what we care about is that it must be close enough to the acceptable response we are given in the set of available options.

FIg : Overall System Pipeline

Results

Training

We trained the transformer with 300k context-response sentences from the ubuntu dialogue corpus and had 100k context response pair for evaluations. Each epoch in the training phase took around 10 hours to run.

The figure below shows the training and evaluation error during the training of the transformer:

The performance of the model on a typical test sample, looks like this.

Evaluations

We evaluated the first phase of the pipeline separately since the performance of the whole systems is heavily dependant on the transformer. Then we evaluated the whole pipeline.

To evaluate the results of the transformer (generator), we used BLEU Score which denotes the unigram overlap between the generated response and the reference response. The BLEU score of the transformer after 9 epochs of training reached 0.07.

Recall@k in this application is the percentage of the times that the expected response is retrieved in the top-k responses. Recall@1 is equivalent to accuracy.

In addition to being a good indicator of the pipeline performance most previous works have used Recall@k for their evaluation part. To be consistent with the previous works we evaluate the overall performance of the system with it. The table below shows the results of our proposed pipeline and a baseline system on 5k contexts.

Discussion and Future Directions

The weak performance of the pipeline is mainly rooted in the inappropriate responses generated by the generator. Therefore any improvement in the generator will have a great impact on the system performance.

Among responses generated by the transformer some phrases were more frequent such as “I don’t know”, “I am not sure”, “There is a bug”. One of the reasons for such results is the jargons that were frequently used in the dataset. These words are all treated as unknown words leading to loss of information and poor results. Note that the conversations in ubuntu corpus dataset are casual and contain lots of grammatical and spelling mistakes. This may be another factor that causes such responses. We plan to use a dataset containing more formal conversations in a general domain to eliminate such effects.

Due to the limited resources, the transformer was trained only for 10 epochs. Perhaps increasing the training time would lead to better results. We should also consider the possibility that generating a relevant response in a conversation is inherently a more challenging task than machine translation, especially in the conversations that require domain knowledge.

Another modification would be to create an end-to-end system by replacing the second phase of the pipeline with a differentiable layer. In this approach, we connect the last layer of the Transformer to this layer directly without decoding. In this way, the model is completely end-to-end and we can train the whole system at once.

This project was done as a part of the course CSCI 599 Deep Learning at University of Southern California. We extend our warm appreciation to Prof. Joseph Lim and our Teaching Assistant Te-Lin Wu for their constant support and encouragement. We’d also like to thank everyone who has supported us in this journey, and making it a wholesome experience.

References:

--

--