Google QUEST Q&A Labeling (Improving automated understanding of complex question answer content) : kaggle competition

Akshay Vispute
15 min readSep 12, 2020

--

Photo by Olav Ahrens Røtne on Unsplash

INDEX : Step by step approach

SECTION 1 : Understanding the problem & data.
1. Detailed overview
2. The business problem
3. About the dataset
4. Exploratory data analysis and pre-processing
-----------------------------------------------------
SECTION 2 : The action plan.
5. Evaluation metric
6. Loss function
7. Baseline model
7.1. K-Fold cross validation
7.2. Post-processing : binning
7.3. Error Analysis
7.3.1. Why these features are not performing well?
7.3.2. Possible workarounds
7.3.3. Limitations with current LSTM model
8. Model with SOTA pretrained embeddings
8.1. BERT
8.2. USE
8.3. XLNet
8.4. RoBERTa
-----------------------------------------------------
SECTION 3 : Inferences and analysis.
9. Final results
9.1. Difference between baseline model and final_model.
9.2. Findings and summary based on the results.

1. Detailed overview :

  • Let’s take a quick overview on the competition details, before we dive deep into the best possible solutions.
  • These days, computers are really good at answering questions with single, verifiable answers. But, humans are often still better at answering questions about opinions, recommendations, or personal experiences. Humans are better at addressing subjective questions that require a deeper, understanding of context — something computers aren’t trained to do well…yet. — (Source : kaggle_competition_page)
  • Let’s understand human understanding mechanism first : Questions can take many forms — some have multi-sentence elaborations, others may be simple curiosity or a fully developed problem. They can have multiple intents, or seek advice and opinions. Some may be helpful and others interesting. Some are simple right or wrong. And human can analyze this kind of information fluently as it’s daily-workflow.

The motive of the competition boils down to identify whether StackExchange question is interesting, whether the answer is suitable for the question context, helping other fellow users or spam etc. In short, participants are expected to build a machinery that should be more like human understanding mechanism.

2. The business problem :

  • For any platform, It is important to know that — how the posted questions and their corresponding answers are interacting with fellow users.
  • This will serve 3 purposes for a business:
  1. Platform will be able to evaluate stronger or weaker areas of an author based on their present contribution to the platform. Therefore, evaluated results can be discussed with an author through suggestions in order to improve further contribution and ultimately the results. Also, platforms can distribute their guidelines for new authors and contributors based on previously evaluated results.
  2. Better the question answering system implies more people would benefit from it, helping users usually turn into attracting more customers. Ultimately, the better business…
  3. Another possible benefit could be, Top solutions of the competition can act as a better evaluation system for complex automated Question Answering (chat-bot like) systems.

The problem Statement :

Hence, this case study is dedicated to solve one very crucial problem i.e., ‘Building a predictive algorithm for different subjective aspects of question-answering.’

3. About the dataset :

  • The question-answer pairs were gathered from nearly 70 different websites, in a “common-sense” fashion. raters received minimal guidance and training, and relied largely on their subjective interpretation of the prompts. As such, each prompt was crafted in the most intuitive fashion so that raters could simply use their common-sense to complete the task. — (Source : kaggle_competition_page).
  • Each row contains a single question and a single answer to that question, along with additional features. The training data contains rows with some duplicated questions (but with different answers).
pandas dataframe : actual dataset
  • The dataset has questions and answers from various StackExchange properties. The task is to predict 30 target features for each question-answer pair. Basically, it’s a ‘multi-label prediction problem’.
  • Target labels are aggregated from multiple raters, and can have continuous values in the range [0,1]. This is called the ‘multi-target regression problem’.

4. Exploratory data analysis and pre-processing :

Let’s take a look at all available features and checking for the missing values.

This dataset has three prominent features on which we have to build the solution : ‘Question-Title’, ‘Question-Body’, ‘Answer’. Each question can have multiple answers along with the question title and its descriptive text. Important to note here that, given dataset is very small i.e., only 6079 rows in training set.

Here we go to next stage, which is nothing but data cleaning and further data processing steps…

# data processing steps
1. remove html tags, html urls, replace html comparison operators
2. remove latex
3. all lowercase
4. decontraction
5. remove non-english-characters
6. remove all special-characters
7. Stop_word removal
8. remove all white-space

All the question-answer pairs resides on StackExchange network which are nothing but internet web pages and its obvious to have ‘jQuery’ or ‘HTML’ language specific tags and keywords in our dataset. These tags and special symbols does not contribute any information to the end task, hence I decided to remove them. If you observe closely, some data point contains non-English Hebrew language specific characters — either we’ve to remove those characters or translate them into English, for current task I preferred to remove non-English characters. For this experiment I chose to remove latex, formulae too. Its proven empirical result conducted through the previous state of the art NLP experiments that applying ‘decontractions’ tend to to give better results.

Wordcloud : Title
Wordcloud : Body
Wordcloud : Answer

Question-title, Question-body, Answers mostly has technical words like ‘file’, ‘error’, ‘function’, ‘server’, ‘error’ etc. This observation is very helpful in our task, since the state of the art sentence embedding like BERT, XLNet are partially pre-trained on some of the Wikipedia technical articles too. If this was not the case— transfer learning would not have worked. We can observe ‘jQuery’ , ‘HTML’ specific keywords in word-clouds. Hence we need to clean up text features very carefully.

Q. Provided data is coming from which categories of subjects?

  • The nature of questions and answers with ‘Technology’, ‘StackOverflow’, ‘Science’ categories would be of similar kind, at the same time datapoints with ‘Culture’ and ‘Life_arts’ cats can have similar kind.
  • Category wise distribution is not so unbalanced, so it’s useful to train the ml model.

Full EDA with inferences can be found out here on my github profile.

5. Evaluation metric :

Given two random variable (y_true) and (y_pred), we need a evaluation metric which can find out how these arrays are correlated to each other. ‘spearman_rank_correlation’ is the robust metric which can capture — complex relationships between arrays up to moderate extent. Because, ‘ranking’ or ‘the order of values’ is the basis of computing covariance between variables, this enables ‘spearman_rank_correlation’ to capture relationships robustly.

Also, it is important to note that it’s a ‘multi-target regression’ problem i.e., target features are real valued features. Hence one of the best suitable evaluation metrics for the problem at hand is the ‘Pearson correlation coefficient (rho)’. The Spearman’s rank correlation is computed for each target column, and the mean of these values is calculated for the submission score.

For more details, please refer : Wikipedia

Spearman correlation coefficient ranges between [-1 , 1]. Negative coefficient implies, given two random variables (X and Y in above dig.) are ‘inversely proportional’ to each other. Positive coefficient implies random variables are ‘directly proportional’ to each other. values near to Zero means there is no relation between X and Y.

6. Loss function :

Since we’re working on ‘multi-target regression’ problem, It’s always critical and the most important to figure out which loss function suits best to our business problem. Let’s not forget the fact that ultimately we want a business problem to get solved using machine learning techniques.

Typically, for any regression problem ‘MSE’ (mean squared error) or ‘RMSE’ (root mean squared error) are the best suitable loss functions but NOT for the this problem.

mean_squared_error

Why ? hmm…let’s dig more down : let’s understand through an example below :

Please note that target variables to be predicted are ranging from 0 to 1. if we take Squared_error of 0.51 and 0.65 | 𝑀𝑆𝐸 = (0.51−0.65)² =0.0196

Squared_error results are so small and hence gradients will also be very smaller. Therefore, the model will not able make any progress on training data.

‘Binary_crossentropy’ is the ideal loss function for this task. Inversely to the MSE loss function, BCE is created for the values ranging between ‘0 to 1’.

binary_crossentropy i.e., log_loss

Q. Why ‘binary_crossentropy’ but not ‘categorical_crossentropy’?

  • Because of Multi-target regression problem, In deep learning modelling, There are many ways to handle multi label target features. Best suitable for the current task is to assign ‘sigmoid activation units at output layer’ with and ‘binary_crossentropy’ loss function.

7. Baseline model :

Sometimes, because of time or space complexities, compute and cost constraints it’s not feasible to deploy state of the art models like BERT etc. in production environment. Therefore better we try out light models first and then analyze the ‘performance-cost’ trade off between SOTA models and baseline models. Accordingly one should choose a model to deploy in production.

Basically, I combined two LSTM based models together to predict all 30 target features at once. First model is trained on ‘question-title’, ‘question-body’ combined together, in order to predict first 21 target features which are related to questions-features. Second model is solely trained on ‘Answer’ feature, which is predicting post 9 features among all the 30 target features. Lastly simple (keras) ‘concatenate’ layer has helped to combine them together.

baseline_model_LSTM

7.1. K-Fold cross validation :

Since provided dataset has very less number of question-answer pairs i.e., 6069 only. It’s important to identify, whether resulting score or loss values are not because of any randomization like ‘train_cv_split’. It’s very cumbersome to employ k-fold-cv technique with deep learning models, than simple machine learning models using scikit-learn library. please feel free to go through the well documented code on my github_repository.

7.2. Post-processing : binning

The evaluation metric — ‘spearman_rank_corr’ takes order of the arrays into consideration. To get good results it became important to get similar values as (y_true) has in prediction results (y_pred).

Let’s assume a target_feature_1 has unique values = [0.25, 0.50, 0.75, 1.0] through out the dataset. Then, for any random data-point with ‘y_pred = 0.25’, It’s very important that model should predict ‘0.25’ not any nearby values like 0.22, because evaluation metric is ‘spearman_rank_corr’.

Hence, I decided to conduct ‘feature binning’ technique on predicted target features. And it has shown smaller improvement in our experimental case study if not very significantly.

7.3. Error Analysis :

‘Analyzing and quantifying the errors’ is very critical procedure in machine learning, it’s always important to understand ‘where’ and ‘why’ model is failing. Accordingly one can improve current solution or can build new solutions.

(X : Target features) vs (Y : spearman_rank_corr) | (Green : train_spearman) (Red : val_spearman)

It’s very easy to make inference by looking at plot above, some features are doing really well with current model and data set, but some of them are not working out as they were expected to. These low performing scores are dropping down the final score value too.

best and worst performing features.

7.3.1. Q. Why these features are not performing well?

  • It’s been observed that all of the worst performing target features are highly imbalanced. Hence, some of them are NOT learnt by a model or model has OVERFIT them on training data.
  • Also, 4 out of 9 i.e., (44%) worst learning target features are ‘Answer’ related, That implies ‘Answer’ feature is not really well understood by a model.

7.3.2. Possible workarounds :

  1. Use (over)sampling techniques.
  2. Train 7 separate models for each of the worst target features.
  3. Answer text is not well understood by a model, better we use some advance pretrained NLP model for ‘Answer’ features..

7.3.3. Limitations with current LSTM model :

  • For regression task with imbalanced data SMOGN (Synthetic Minority Over-Sampling Technique for Regression with Gaussian Noise) is the perfect way out. But we cannot use it for current LSTM model architecture. Because we have text as input data and with current LSTM architecture we have to tokenize text input.
  • SMOGN need numerical training features (because SMOGN is nothing but SMOTER with binning), it cannot work with textual input features. SMOGN can work with pretrained embedding, else one has to do manual feature engineering over textual data.
  • In addition to that we have 30 target features, SMOGN can handle one target feature at once. to use SMOGN we have to train 30 models, which is very expensive solution in practical scenario.

8. Model with SOTA pretrained embeddings:

First things first : Let’s Understand what pretrained sentence embeddings are…

8.1. BERT : (Bidirectional Encoder Representations from Transformers)

BERT (Bidirectional Encoder Representations from Transformers) is language representation deep neural networks based technique. The biggest advantage of BERT over most of the previously found embedding techniques (except: w2v, but its word embedding) is ‘Transfer learning’ and ‘sentence-embeddings’. In 2014 VGG16 had made its own place in the space of computer vision community because of its ‘Transfer learning’ advantage, which was not possible in Natural language processing space at early stages. There are two existing strategies for applying pre-trained language representations: ‘feature-based’ and ‘fine-tuning’. Since the dataset has StackOverflow specific documents most likely ‘fine-tuning’ will help for semantic searching, But because of computational expenses I am limiting ourselves to ‘feature-based’ embedding only. This model has been pre-trained for English on the Wikipedia and BooksCorpus databases. BERT pretrained model architecture : here we get 2 options

I. BASE : (L=12, H=768, A=12, Total Parameters=110M)

II. LARGE : (L=24, H=1024, A=16, Total Parameters=340M)

where; L = no.of transformer blocks,

H = Hidden layer size | A = number of self-attention heads

8.2. USE : (Universal sentence encoder)

As the name (UNIVERSAL SENTENCE ENCODER) itself is self explanatory, ‘USE’ takes sentences as input and gives high dimensional vector representations of input text sentences. The input can be a variable length English text and the output will be a 512 dimensional vector. The universal-sentence-encoder model has two variants in it. one is trained with a deep averaging network (DAN) encoder, and another is trained with a Transformer. In this paper authors have mentioned that Transformer based USE tends to give better results than DAN. But it comes with price in terms of computational resources and run time complexity. The USE model is trained on the sources which are Wikipedia, web news, e Stanford Natural Language Inference (SNLI) corpus, web question-answer pages and discussion forums etc. The Author explains the USE model was created by keeping in mind unsupervised NLP tasks like Transfer learning using sentence embeddings, Hence it makes complete sense to use ‘USE’ in our Q&A labeling task.

8.3. XLNet :

XLNet is the transformer based auto-regressive language model. XLNet integrates ideas from Transformer-XL, the state-of-the-art auto-regressive model. BERT achieves better performance than pre-training approaches based on auto-regressive language modeling.

Disadvantages of BERT :

  1. However, BERT neglects dependency between the masked positions and suffers from a pretrain-finetune discrepancy.
  • E.g. Sentence : “[New, York, is, a, city]”
  • Masked input : [MASK] [MASK] is a city.
  • BERT = log p(New | is a city) + log p(York | is a city),
  • XLNet = log p(New | is a city) + log p(York | New, is a city).
  • here BERT is neglecting the dependencies between masked words, where XLNet is able to capture the dependency between the pair (New, York),

2. BERT corrupts the input with masks and suffers from pretrain-finetune discrepancy. In real life applications, we do not have inputs that are masked. How BERT handles it in reality remains ambiguous.

XLNet is a transformer based model, and it basically revolves around the ‘Attention weights’. All the hidden states from Encoder are made available to the Decoder, Decoder decides on which state it has to focus more by considering the provided target labels. It is by weighting each of the hidden states of the encoder.

Permutation Language Modeling (PLM): PLM is the idea of capturing bidirectional context by training an autoregressive model on all possible permutations of words in a sentence.

Source : XLNet original_paper
  1. input sequence is shuffled (permutation)
  2. Token preceding to the x4 in the permutation sequence are considered Hence no [MASK] is needed and input data need not be corrupted.
  • PLM with transformer : To implement XLNet, the transformer is tweaked to look only at the hidden representation of tokens preceding the token to be predicted.
  • XLNET is pretrained on the same data as BERT is trained on : ‘wikipedia and BookCorpus data’ with 12-layer architecture and the same model hyper-parameters as BERT-Base

8.4. RoBERTa : (Robustly optimized BERT approach)

RoBERTa stands for Robustly optimized BERT approach, in which Facebook ai and University of Washington (computer science) team has improved original BERT architecture with several tweaks in the network. In RoBERTa base architecture remain same as BERT (L = 12, H = 768, A = 12, 110M params)

  1. Dynamic Masking : The original BERT implementation performed masking once during data pre-processing, resulting in a single static mask. To avoid using the same mask for each training instance in every epoch, training data was duplicated 10 times so that each sequence is masked in 10 different ways over the 40 epochs of training.
  2. Model Input Format and Next Sentence Prediction : Original BERT is pretrained on two tasks — Masked language modelling (MLM) and Next sentence prediction (NSP). RoBERTa authors found NSP is downgrading the performance of the model on downstream tasks and hence they decided to remove NSP from pretraining of the RoBERTa.
  3. Training with large batches : Past work in Neural Machine Translation has shown that training with very large mini-batches can both improve optimization speed and end-task performance when the learning rate is increased appropriately. originally trained BERT(BASE) for 1M steps with a batch size of 256 sequences, this is equivalent in computational cost. RoBERTa is pre-trained on 8k sequences
  4. Text Encoding : BPE (Byte-Pair Encoding) is a hybrid between character- and word-level representations that allows handling the large vocabularies. Instead of full words, BPE relies on subwords units, which are extracted by performing statistical analysis of the training corpus. The original BERT implementation uses a character-level BPE vocabulary of size 30K, which is learned after preprocessing the input with heuristic tokenization rules. RoBERTa trained with a larger byte-level BPE vocabulary containing 50K subword units, without any additional preprocessing or tokenization of the input.
  5. Training data : BOOKCORPUS + English WIKIPEDIA = (16 GB). This is the original data used to train BERT. Additionally RoBERTa is trained with CC-NEWS (76 GB), OPENWEBTEXT (38 GB), STORIES (31 GB). i.e., 160 GB of text data in total.

For this section, I tried 2 approaches as follow:

1. Individual modelling on each sentence embedding.

2. Combined modelling on all the sentence embeddings.

9. Final results :

Final result table

9.1. Difference between baseline model and final_model :

  1. It’s been observed that ‘answer_well_written’ target feature is better learnt than baseline_model, hence we found improvement in val_score.
  2. for each ‘taget_feature’, ‘spearman_score’ is slightly or moderately improved so the final score.

9.2. Findings and summary based on the results :

  • Combination of ‘BERT, USE, RoBERTa, XLNet’ has given winning results but not with significant margin. Hence ‘USE’ is preferable.
  • So far, ‘USE’ embeddings are far more preferable at production environment for our task, Reasons are as follow :
  1. Comparatively, forward pass through ‘USE’ is not very expensive task in terms of time and computation power.
  2. ‘USE’ was actually built to capture semantic similarity, USE embeddings are potential embeddings which can work well with further tweaks.
  • No. of datapoints that actually matters in the NLP kind of problems. We dealt with very less no.of datapoints hence model failed to show further improvements.
  • Pretrained sentence embeddings are more powerful when :
  1. Manually trained on the ‘similar data’ or the ‘similar task’ as current problem has…
  2. Fine tuned on current dataset.

(NOTE : Both task are computationally very expensive, hence didn’t introduce in the current work.)

  • Imbalanced data : in this experiment very basic and key learning is no model can deal with imbalanced data. Complex models like BERT can handle data imbalancing problem up to some extent but not completely. Hence better we get balanced and enough data.
  • Post processingtarget feature binning : since the metric is ‘spearman_rank_corr’ which takes order into consideration. It became important to get similar values as (y_true) has in prediction results (y_pred). And it has shown smaller improvements if not very significantly.

If it interests you, Learning to build a ‘SEARCH ENGINE from scratch’, please go through : https://medium.com/@vispute.ak/build-a-search-engine-based-on-stack-overflow-questions-88a4bc0c195c | Where I explained each concept in detail and provided well documented code.

Please find complete code with documentation : https://github.com/vispute/Google-QUEST-Q-A-Labeling-kaggle-competition

Feel free to connect with me on LinkedIn : https://www.linkedin.com/in/akshay-vispute-a34bb5136/

References :

1. BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding — https://arxiv.org/pdf/1810.04805.pdf

2. USE : UNIVERSAL SENTENCE ENCODER https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/4 6808.pdf

3. XLNet : https://arxiv.org/pdf/1906.08237.pdf | XLNET explained in simple terms : https://towardsdatascience.com/xlnet-explained-in-simple-terms-255b9fb2c97c

4. RoBERTa : https://arxiv.org/pdf/1907.11692.pdf

--

--

Akshay Vispute

Experienced data science professional with hands-on exposure to Python, Data Science, Machine Learning, and Deep Learning.