Dear BERT, if A is a Duplicate of B, Then B is a Duplicate of A

Aditya Joshi
SEEK blog
Published in
4 min readMay 5, 2022

In this blog post, Aditya Joshi, a data scientist in Artificial Intelligence & Platform Services (AIPS) at SEEK, Melbourne, writes about his upcoming ACL 2022 paper on improving consistency of symmetric classification tasks.

Photo by Artem Kniaz on Unsplash

Machine learning models that process more than one input entity are common in several areas of AI. An example would be a model that takes two photographs as input and determines if the person in these photographs is the same. This may be done in the case of document verification on a platform such as Certsy. Another example would be the prediction of a candidate’s relevance to a job ad. This would involve predicting a relevance score based on a candidate’s CV and a job ad as the two inputs. This score may then be used by a platform such as SEEK to nudge candidates to apply for a job if their skills and experiences are predicted to be highly relevant.

Some of these tasks that take as input more than one entity are inherently symmetric, i.e., their output is not expected to change even if the order of the inputs is swapped. As a running example of a symmetric classification task, consider ‘job ad duplication detection’ — a model that predicts if two job ads are duplicates of each other. A BERT-based architecture is potentially valuable for this task. A typical classifier for this task receives the two inputs with a special separator ([SEP]) token. Symmetricity means that ‘Job Ad 1 [SEP] Job Ad 2‘ should produce the same output as `Job Ad 2 [SEP] Job Ad 1’. But, do BERT-based models produce consistently symmetric output?

It turns out that if you swap the order of inputs, BERT may produce a different output label or a lower confidence score of prediction. In our Findings of ACL 2022 paper, we show how this consistency can be ensured. In essence, we modify the loss function of the classifier to penalise inconsistency. Before we look at our loss function, let’s see what a typical BERT-based classifier loss function is:

Where y is the correct label and y^ is the predicted label. The i on the right hand side represents the total number of data points. The objective of classifier training is to minimise the function above, i.e., learn weights such that the predicted labels are the same as the true labels as frequently as possible.

So what are the requirements of a loss function which ensures consistency? Intuitively, a consistent symmetric classifier that takes two input job ads J1 and J2 must ensure two constraints:

  1. The output must be correct even if the order of the inputs is swapped. This means:
    a. The classifier must correctly predict the label when J1 is followed by J2 in the input; AND
    b. The classifier must correctly predict the label when J2 is followed by J1 in the input.

2. The output must be the same even if the order of the inputs is swapped. One way to ensure this is to require that the vector learned in the last layer of the neural model is similar in the case of the two permutations of inputs. This means that the final representation in the model (i.e., the softmax vector) must be similar, irrespective of the order of the job ads J1 and J2.

The two requirements above reflect in the loss equation of our proposed model (which we call BERT-with-consistency-loss):

The first two terms above optimise for requirement 1 while the last term optimises for requirement 2. The lambda in the last term allows us to tune the extent to which consistency must be ensured.

So, what happens when we train BERT-with-consistency-loss? The figure above shows an actual example output for the task of paraphrase detection. The objective is to determine if X and Y are paraphrases of each other. The ‘each other’ indicates that it is a symmetric classification task. As seen in the first column, BERT produces opposite outputs if the order of inputs is reversed. In contrast, BERT-with-consistency-loss (shown in the second column with the bespectacled BERT) does not!

In the paper (link at the bottom), our empirical analysis shows that the model does better than BERT for several standard NLP datasets (called the GLUE benchmark) in terms of consistency of output label and deviation in confidence scores for symmetric classification tasks.

There’s much more to symmetric classification. However, in this blog post (and the paper), we show that modifying the loss function is an effective way to improve the consistency of symmetric classification tasks. Since many business problems with multiple inputs are inherently symmetric, our consistency loss can find applicability in diverse domains.

So the next time you wish to impose an additional constraint on your classifier learning algorithm, ask yourself: can I integrate the constraint into the objective function?

The pre-print of our paper is here: https://arxiv.org/abs/2203.13491

--

--