Quora Smart Compose of Sincere Questions

Published in

ETHER Labs

7 min readDec 24, 2018

Aim: To develop a predictive feature, called smart compose, which tries to understand typed text so that AI can suggest words and phrases to finish your Quora questions/answers with sincere words. The recommended words from the AI mollifies Quora users from using a non-neutral tone or having an exaggerated tone to underscore a point about a group of people or making a rhetorical, disparaging, discriminatory, inflammatory statement on Quora platform.

What you will learn in this post:

Transfer learning implementation in NLP tasks: Transfer learning is a machine learning method where a deep learning model developed for a task is reused as the starting point for a model on a second task. In the vision, it has been in practice for some time now, with people using models trained to learn features from the huge ImageNet dataset, and then training it further on smaller data for different tasks. Recently advances in NLP shows that transfer learning in NLP leads to better accuracies, faster convergence and lesser training data requirement than a conventional word embeddings approach.
Bare Pytorch implementation of the pipeline without the use of TorchText for data pre-processing and batch processing: This allows complete control of the pipeline and flexibility in building the language model. TorchText is still under active development and is having some issues that you might need to solve yourself.
Language model implementation: The implementation can be extrapolated to generate encoded representations of Quora ‘sincere text’ dataset (general embeddings of the Sincere text dataset) and determine if a new question/statement made on Quora portal is Insincere. This can be achieved by calculating the cosine distance between the new user input(question/statement) and the general encoded representations, which can be obtained from my Language model described below.

What’s language model and what’s my dataset :

LM models the probability of word sequences. “Go.See the doctor” is more probable than “walk. pair REd”.
The dataset involves 1225332 lines of non_spam data (sincere questions/statements) from Quora. The dataset can be obtained from the Kaggle competition. You can write a simple script to extract the sincere text from the classification dataset shared in the competition portal. In the implementation described below, I am taking a subset of this data. You might want to apply some data filtering and removing noise if you take the whole dataset for your experimentation.

Logic :

By keying in on common speech patterns, my AI model predicts the next probable words. Since the LM inherits its features from a dataset of sincere questions/statements, the recommended words from the AI model will have its range in non-inflammatory/non-discriminatory vocabulary.

The model is seeded from transfer learning, inheriting the encoder representations of the wiki-103 pre-trained model, shared by fast.ai NLP [here]. This helps to achieve better convergence. This is because wiki-103 has already learned language characteristics from a larger dataset and my aim here is to funnel it to learn features related to sincere comments.

Keep reading for more details.

Implementation:

The code can be accessed here.

Data Preprocessing, Training data preparation and Training Batch formation :

A Variable wraps a Tensor. Variable also provides a backward method to perform backpropagation. Torch.optim inherits the loss optimizer for our DL. The flatten function is for word tokenization. Tokenization later facilitates the conversion of word tokens to indices. The word tokens enter into the training model in the form of these numerals(indices).

The above function reads lines from the dataset, tokenizes the words using the flatten function, and creates a vocabulary of each unique word in the dataset. Each unique word has a unique index associated with it. This word index pair is stored in the word2index dictionary. Note that the index 0 is assigned to <unk>. Any word which may arise during the testing/inference, which never appeared in the training set, will be associated with the <unk>. The </s> is appended to the end of each line to mark the end of the sentence to the model.

Conversion of word tokens to indexes and storing it into Tensors for training data preparation

The above function does a conversion of word tokens to indices and stores it into Tensors for training data preparation. Any word token whose index is not found in the word2index dictionary is assigned index 0 (<unk>)

For my implementation and with the sincere text dataset, I get a vocab size of 45,462 words. The LM will predict the next word in the form of indices. We need to trace back the words associated with these indices. The index2word function facilitates this task.

It's always safe to store the word2index. Sometimes, you may end up training your model, putting in the time and may end up losing the word2index. Then, even though your model might be giving correct predictions, but you may not be able to trace it if you end up generating new word2vec.

Starting from sequential data, batchify arranges the dataset into columns. For instance, with the alphabet as the sequence and batch size 4, we’d get as follows:

These columns are treated as independent by the model, which means that the dependence of e. g. ‘g’ on ‘f’ can not be learned, but allows more efficient batch processing

In the batchify function above, suppose my batch_size=4, and my word tokens are like above, as shown in the image. Then, nbatch calculates the length of the columns, in order to maintain the batch size of 4. The ‘data.narrow’ chunks data, which did not fit in nbatch. The data.cuda() allows running our training batches on the GPU. The training set prepared from the ‘prepare_dataset’ function is passed through the batchify function for batch processing.

The above function defines the labels for the training data.So if my training input is “I have a question</s>”, the label would be “I have a question </s>Next”.

Transfer Learning Implementation:

The wiki-103 is basically an auto-encoder model with encoder weights and the decoder weights, pre-trained on large Wikipedia and other datasets shared by fast.ai. The pre-trained LM weights have an embedding size of 400, 1150 hidden units and 3 layers. We need to match the embedding size values with the target Quora LM so that the weights can be loaded up. Torch.load loads the pre-trained model. So the embedding size for the LM model= 400. The rest of the parameters can be tuned as per GPU available and the training data size.

Initializing Quora LM model with pre-trained weights

The pre-trained model comes along with the index2word, for the dataset on which it was trained(itos2). We use that to generate the corresponding word2index(stoi2). Now, for the words which belong to the Quora dataset vocabulary, which I built earlier, if those are present in the stoi2, I initialize the Quora LM model with the weights as given in stoi2. And for the words which don’t belong to stoi2, are initialized with the mean weight values of the pre-trained model. These combined weights are stored in new_w.

Network Architecture:

I have implemented a fairly simple model, to analyze the power of transfer learning.

The super function inherits the functions of nn. Module incorporated in pytorch for LSTM implementation. The init_weight is where I initialize my Quora LM model with pre-trained weights stored in ‘new_w’. The linear function weights are initialized with xavier_uniform. The shape of hidden state for the LSTM model is [Num_layers,batch_size,hidden_size]. The detach_hidden solves the exploding gradients problem. The LSTM returns an output value and the hidden state. The hidden state encompasses the encoded representations for the entire Quora sincere dataset and can be extracted later to generate text embeddings. The embeddings can be used to find the cosine distance from any new input on Quora to predict if the statement is insincere or sincere.

The output returned from the LSTM is reshaped to [sequence_length*batch_size,hidden_size] and passed through a linear function for calculating the loss.

The Embed-size remains equal to the embedding size of the pre-trained model. My experiments yielded me better results at Hidden_Size values less than or equal to 200 for the subset data I took for training. The num_layers in my LSTM model was 2, but I want to test on 3 as well.

This is the standard Pytorch pipeline for training the model. The getBatch function returns the input and target labels. Since the backward() function accumulates gradients, and we don’t want to mix up gradients between mini-batches, we have to zero them out at the start of a new minibatch. This is taken care by the zero_grad function. The loss function used in the implementation is nn.CrossEntropyLoss().

The testing and inference code can be accessed from the github link.

If you liked this article, feel free to give it a clap. If you are interested in AI behind Humanoid Robotics and Reinforcement Learning, follow me on medium as I would soon be publishing contents and implementations on the same.

References :

Quora Smart Compose of Sincere Questions

Data Preprocessing, Training data preparation and Training Batch formation :

Transfer Learning Implementation:

Network Architecture:

Written by Shivam Akhauri