Email Smart Compose: Assist in Sentence Completion

Published in

Analytics Vidhya

13 min readJun 25, 2020

1. Business Problem

Email has become one of the popular means of communication all over the world. Email can be used to share messages, files among a group of people without much hassle. With higher penetration of internet all over the world, the number of people using emails will continue to grow. Improving the user experience by simplifying the writing process remains top priority among the various email service providers like Gmail, Yahoo Mail and Outlook etc. Sentence completion while composing the email is one of the task which can be extremely helpful to the user. The suggestions given while composing the email can assist user by cutting back on repetitive writing.

2. Problem Statement

The task is to develop a model which can predict the next words in the sequence given the initial few words. This is essentially a language modelling task. A language model predicts the probability of next word from a vocabulary of words. Thus the problem can also be interpreted as a multi-class classification task. The language models can operate at word level or character level. The use of neural networks in language modelling has become the preferred approach. The popular models are Encoder-Decoder models and Transformer based models.

The dataset for building the email smart compose models is obtained from the Enron Email Dataset, which is available on Kaggle. The dataset contains approximately 500,000 emails generated by the senior management of the Enron Corporation. The data can be downloaded from this link.

The methodology adopted in designing the system can broadly be classified into three steps namely sentence extraction, data preparation and modelling. For this study, Encoder Decoder models with and without Attention are developed. These models are both word and character based. In addition to these models a pretrained GPT2 (Generative Pretrained Transformer-2) model is also fine-tuned on the above dataset. The metric used for evaluation is BLEU Score. The BLEU or the Bilingual Evaluation Understudy, is a metric for evaluating a generated sentence to one or more reference sentences. The BLEU Score varies from 0 to 1. A perfect match results in a score of 1.0, whereas a perfect mismatch results in a score of 0.

3. Sentence Extraction

A subset of the data is shown in the images below.

data = pd.read_csv('emails.csv')
data.head()

The entire email (To, Subject, Body etc.) is in the message feature. Let’s look at a message from the data.

We are interested in the texts (highlighted in red box) included in the body i.e. content of the email.

Once the texts are extracted, the sentences are chosen such that they satisfy all the following criteria.

Sentences having number of words between 3 and 21 are chosen. (Generally sentences contain fewer than 20 words. Additional information on this link)
Sentences must begin with capital letter and end with a full stop or question mark. All the characters are in lowercase except the first character.
Sentences must not have numbers or special characters.
All the punctuations in the sentence are removed.

Such strict criteria for choosing the sentences ensures the sentences contain fewer errors and are grammatically correct. It also ensures that the sentences containing names are omitted. Below are the samples of extracted sentences.

After extracting the sentences, the word counts in all the extracted sentences is analyzed. Below image contains bar graph of word counts. The words are sorted in descending order of word counts.

From the above graph, it can be seen that there are approximately 33000 unique words. Higher the vocab size, more number of sentences are needed to train the model. In addition to this, higher vocab size results in more training parameters for word based model. Thus reducing the vocab size, helps in overcoming the above shortcomings and obtaining sentences which are frequently used. Consequently, the sentences that contain words which occur fewer than 120 times are dropped. Hence the vocab size obtained after dropping such sentences is 1467.

After filtering out the sentences with rarely occurring words, the data is split into train and validation.

4. Data Preparation:

In this study, word based and character based models are developed. For training the model, the data is featurized in a specific way which is discussed below.

4.1. Word Featurization:

Consider the sentence ‘’thanks so much for your patience’’. When the model is given input as ‘’’thanks so much’’, it must predict ‘’for your patience’’. When the model is given input as ‘’thanks so much for’’, it must predict ‘’your patience’’. Thus sentences must be split in such a way that the model can learn above relation. In addition to this, model must know start and end of the sequence. Thus <start> and <end> tokens are added. Below image shows the featurized data for the mentioned sentence.

For modelling, the featurized data must now be converted into sequence of integers. Each unique word would be assigned a number. In order to ensure that all the sequences are of same length, zero padding is done for all the sentences. The maximum length of sequences for X and y is 23. Below image shows the integer encoded sequences corresponding to word sequences for index 0 and 1 of above image.

Pretrained Global vectors (GloVe) are used for obtaining the vector representation of words. Each vector is of 300 dimension. The GloVe can be obtained from this link.

For training word based models, 40000 sentences are used which undergo above featurization.

4.2. Character Featurization:

Character featurization is similar to word featurization except the sentences are broken down into characters instead of words. Here the start and end of the sequence is denoted by ‘’#’’ and ‘’*’’ respectively.

Below image shows the character featurization for the sentence ‘’send the file’’.

The character sequences are encoded into integer sequences. In addition to that, all sequences are zero padded to ensure maximum sequence length of 132 for X and y.

For training character based models, 5000 sentences are used which undergo above featurization (Character based models are slow to train, hence lower sentences are chosen).

5. Modelling:

5.1. Encoder-Decoder Model:

The model consists of an encoder, decoder and dense layer.

The Encoder is defined as follows:

The encoder consists of an embedding layer followed by a LSTM layer.
Embedding layer connects the word/character with its vector representation. The input to embedding layer is input sequence (X) of shape (batch size, sequence length). The embedding layer outputs a tensor of shape (batch size, sequence length, embedding dimension). The embedding dimension is 300.
The LSTM layer returns the outputs of all timesteps [Shape: (batch size, sequence length, units)] and hidden states of last timestep. The hidden states consists of encoder output [Shape: (batch size, units)] and cell state [Shape: (batch size, units)] at last timestep. The average of encoder outputs at all timesteps and cell state of last time step of LSTM layer is passed to the decoder.

The Decoder is defined as follows:

The decoder consists of an embedding layer, LSTM layer and a batch normalization layer.
During training, the input to the decoder is the output sequence (y) [Shape: (batch size, sequence length, units)].
The embedding layer outputs a tensor of shape (batch size, sequence length, embedding dimension).
The LSTM layer has two inputs namely the outputs from embedding layer and the hidden states. In case of encoder LSTM layer, the hidden states for first time step where initialized to zeros. In case of decoder LSTM layer, for first timestep, the hidden state of decoder is equal to outputs from encoder.
The LSTM layer output for all timesteps [Shape: (batch size, sequence length, units)] passes through a batch normalization layer.
The decoder returns the output from batch normalization layer and hidden states of LSTM layer.

The Encoder-Decoder Model for training is defined as follows:

The output from the decoder passes through a dropout layer followed by a dense layer with number of units equal to output (y) vocab size. The dense layer outputs a tensor of shape (batch size, sequence length, vocab size).

While training, the model receives the ground truth output from previous timestep as input. This method is known as teacher forcing. The model will compose of two inputs namely X and y (all tokens except last token). The model output will compose of y (2nd token onwards). Thus decoder will generate predictions from 2nd token onwards (start token won’t be predicted by model).

Categorical Cross Entropy is used as loss function for the model.

The zeros in the actual tensor are masked while computing the loss function. This ensures that tokens (encoded as 0) which denote padding are not included in computing the loss function.

Below image shows the encoder decoder model (considering word based embedding).

During inference mode, the input sequence is given as input to encoder. The outputs from encoder is connected to decoder. The decoder takes start token as input and predicts the word/character. This predicted word is given as decoder input for next timestep.

5.2. Encoder Decoder Model with Bahdanau Attention:

The model consists of an encoder, attention mechanism, one step decoder and the decoder (repeats the one step decoder). The attention mechanism is housed in the one step decoder.

The Encoder is defined as follows:

The encoder consists of an embedding layer and a LSTM layer.
The input to embedding layer is input sequence (X) of shape (batch size, sequence length). The embedding layer outputs a tensor of shape (batch size, sequence length, embedding dimension).
The LSTM layer returns outputs at all timesteps [Shape: (batch size, sequence length, units)] and hidden states at the last timestep [Shape: (batch size, units)]. The encoder returns these outputs

The Bahdanau Attention is defined as follows:

The input to attention mechanism are encoder outputs of all timesteps (hs) and decoder hidden state at last timestep (ht) [Shape: (batch size, units)].
Scores for each timestep of the decoder are computed in the following manner.

ht is reshaped to include the time axis. Updated shape is (batch size, 1, units).
Scores are obtained using the below equation. Here W₁, W₂ and V are weights learned by model. The shape of scores tensor is (batch size, sequence length, 1)

The scores are softmaxed along axis of sequence length to get the attention weights [Shape: (batch size, sequence length, 1)]. The attention weights quantity the weights given to each token of input sequence in predicting the decoder output for that timestep.

3. The attention weights tensor is multiplied by hs, followed by summation along the sequence length axis to obtain the context vector [Shape: (batch size, units)].

The one step decoder is defined as follows:

It consists of attention mechanism, embedding layer, LSTM layer and dense layer.
The attention mechanism computes the context vector from the encoder output at all timesteps and hidden state from previous timestep of decoder.
The tensor obtained by passing the decoder input to embedding layer is concatenated with the context vector which serves as input to LSTM layer.
The initial states for the LSTM layer are decoder hidden states from previous time step.
The output of LSTM layer is passed to dense layer which predicts the word/character for that timestep.

The decoder is defined as follows:

The decoder runs the one step decoder for each timestep of decoder input (after the start token).
For the first timestep the hidden states of decoder are initialized with the hidden states of last timestep of encoder.

The Encoder Decoder Model with Attention is defined as follows:

Similar to encoder decoder model, this model will compose of two inputs namely X and y (all tokens except last token). The model output will compose of y (2nd token onwards). Thus decoder will generate predictions from 2nd token onwards (start token won’t be predicted by model).

Same loss function as Encoder Decoder models, is used in this case.

Below image shows the encoder decoder model with attention mechanism (considering word based embedding).

During inference mode, the input sequence is given as input to encoder. The hidden states from encoder is connected to one step decoder for first time step. For first timestep, the decoder takes start token as input and predicts the word/character. The context vector is computed by the attention mechanism from the hidden state of previous timestep of decoder and all encoder outputs. The context vector concatenated with the decoder output of previous timestep is used to obtain the decoder output for the current timestep.

5.3. GPT2 Model:

GPT2 model is a text generation model based on transformer architecture released by OpenAI. OpenAI has released three variants of GPT2 model namely small (124 million parameters), medium (355 million parameters) and large (774 million parameters). For this study, small variant of GPT2 is used.

GPT2 uses transformer decoder blocks. GPT2 small consists of 12 decoder blocks. Each decoder block consists of a masked self-attention layer and feed forward neural network. The masked self-attention layer and feed forward layer in each decoder block has a residual skip connection around it and is followed by layer-normalization. In case of masked self-attention, the tokens to the left of the position are only considered while computing the attention weights. More details about inner workings of a decoder block can be found here.

The input and output tokens of GPT2 model are byte pair encodings. These are intermediate to word and character based encodings. The byte pair encodings compresses the text to the shortest combination of bytes including case/formatting.

During evaluation, GPT2 outputs one token at a time (only one path will be active). Each layer of GPT2 has retained its own interpretation of previous tokens and use it in processing the current token. Once a token is produced, it is added to sequence of inputs to generate output for next time step.

The GPT2 model is fine-tuned on the Enron dataset using gpt-2-simple library developed by Max Woolf. The input data for fine tuning the model is csv file of sentences where each sentence is enclosed within start and end token. The start and end tokens in GPT2 model are denoted by <|startoftext|> and <|endoftext|> respectively.

6. Model Predictions:

6.1. Encoder Decoder Models:

Below image shows the sample of predictions for the validation data using word based encoding.

Below image shows the sample of predictions for the validation data using character based encoding.

6.2. Encoder Decoder Models with Bahdanau Attention:

Below image shows the sample of predictions for the validation data using word based encoding.

Below image shows the sample of predictions for the validation data using character based encoding.

6.3. GPT2 Model:

Below image shows the sample of predictions for the validation data.

7. Summary:

The loss vs. number of epochs for different models is shown in image below.

Below table summarizes the evaluation metrics for all the models. In case of GPT2 model, BLEU Score is computed for random 100 samples. For remaining models, BLEU Score is computed for random 1000 samples.

8. Conclusion:

Encoder decoder models without attention perform much better than those with attention mechanism. The performance parameters for validation data are superior for non-attention based models. The quality of predictions from non-attention based models are significantly better than the attention based models. The additional context vector seem to add no value in helping the model predict the next word.
The training and validation loss curves are closer to each other when it comes to character based models.
The GPT2 model is designed for generating longer sentences and/or paragraphs. Even though the BLEU Score for validation data is less than rest of the models, the quality of predicted sentences are more grammatically correct than attention based models.
The performance metrics of encoder decoder models for word and character based are quite similar. The quality of predictions for validation data by the word based model are slightly richer than the character based model (This is happening because character based models are trained on lesser data owing to computational limitations). Thus encoder decoder word based model is chosen for deploying in production.

9. Future Work:

The possible improvements that can be made to improve the solution are as follows:

Increasing the size of training data and using better computational resources (All the models in this study were trained using Google Colaboratory).
Using pretrained BERT embeddings instead of GloVe embeddings for word based models.
Modifying the architecture of encoder decoder models by incorporating multi-layer LSTM and skip connections.
Using Luong based attention for encoder decoder models.

10. References:

The application developed using Streamlit is deployed on AWS. Below video shows few predictions made by the model.

To go to Streamlit app, click here.

The full code written in Jupyter Notebook is available here. Any feedback/criticism regarding the article can be submitted to menitishsawant@gmail.com or on LinkedIn.