Hard Negative Mining in Nature Language Processing (How to Select Negative Examples in Classification and Rank Task)

11 min readSep 14, 2020

1 Introduction

First, introduce the basic definitions and common practices of classification tasks and rank tasks in NLP, and then introduce the meaning of negative examples in these two tasks.

1.1 Classification Task

The input is a text, and the output is the classification of this text. It is widely used in many tasks. Intent recognition, semantic matching and sentiment analysis all belong to this task. Before the appearance of deep learning, the main approach was manual features + XGBoost (or logistic), and the performance was mainly determined by the engineer’s understanding of the business and the quality of manual features. With deep learning, the char in text is mapped into dense vectors, then semantic understanding is done through CNN or LSTM, and finally the classification results are obtained through the fully connected layer. After the advent of BERT, the classification has become more easy, just get the CLS vector and then pass to the fully connected layer. Of course, this is the basic approach, we needs to make corresponding changes to the algorithm and training methods based on business.

1.2 Rank Task

The input is a set of samples, and the ranking of each sample is output. Search engines mainly use this kind of technology. The main technology used in rank tasks is Learn To Rank (LTR). With the development of deep learning in the image field, a variety of similar research directions has appeared, such as Metric Learning, Contrastive Learning, Representation Learning, etc. In fact, the nature of these research directions are relatively similar, that is, ** let the model learn to measure the distance between examples**. The famous Triplet Loss is Pair-Wise in LTR. There are many ways to do this kind of task, all kinds of tricks, if you are interested, you can read some related papers. I will just talk about some of my experience in LTR:

Don’t think of rank tasks as classification, these two are essentially different
No matter how fancy the sorting task is, the goal of training is to get a scoring model, and the score only needs to be used to compare the size (relative score) and does not need to be meaningful (absolute score, this depends on the specific algorithm used). So I prefer Triplet Loss, in my opinion It directly shows the essence of the rank task.
The two examples are mapped to vectors first and then the score is calculated (DSSM), the speed is fast and the performance is poor; the two samples are directly sent to the model to calculate the score, the performance is good, and the speed is slow.

1.3 The Importance of Negative Examples

In the above two tasks, negative samples are inevitably used. For example, short text similarity matching in classification tasks, semantically unrelated text pairs are negative samples, and in sorting tasks, the samples with the lower rankings can be negative examples. Positive and negative samples have the following two categories: Easy Example : That is, the model is very easy to make correct judgments. Hard Example : That is, it is difficult for the model to make correct judgments, and these examples often brings great losses to the training. Compared with simple samples, difficult examples are more valuable, it helps the model learn to the boundary quickly, and it also speeds up the model convergence rate**. In the actual situation, the positive samples are basically determined, and it is not easy to expand and modify. However, the negative samples have a very large selection space. For example, in a search task, the document clicked by the user is interpreted as a positive sample. Then the rest of the documents on this page are all negative samples. When training the model, it is obvious that these negative samples cannot be all used, so Hard Negative Example Mining becomes very important!

2 Hoe to Select Negative Examples

2.1 Negative Examples Selection method based on Statistical Measurement

Calculate some statistical measures of candidate negative samples and select negative samples based on the criteria. For example, in a short text matching task, select samples with a higher similarity to the target sample as negative examples. This type of negative example can make the model learns the semantic information represented by the text as much as possible, rather than simply learning the literal meaning. Imagine that in a semantic matching task, all negative samples are randomly generated, and the combination of characters without rules, then the model will be able to quickly converge, however useless in practical application. In the search task, you can use TF-IDF, BM25 and other methods to retrieve top-k as a negative example. Pay attention to that the distribution of training data and test data is consistent. If your model needs to score and sort all documents in the entire search framework, So in addition to top-k as a negative example, we need to randomly sample some as negative examples.

2.2 Negative Examples Selection method based on Models

The method in section 2.1 is too simple. The negative sample selected by this method may not be the sample that can best dominate the direction and size of the model gradient update. Therefore, a simple method is to use the trained model to predict all negative samples and find out the samples that are wrong predicated as high-quality negative samples, and then train and optimize the model. The overall process is as follows:

This method is logically correct, and the performance is good, but the biggest problem is that the time consumption is too serious. After several rounds of training, it is necessary to predict and find the most valuable negative sample on all negative samples. It’s too time-consuming in deep learning. So I have two solutions. One is not to use the model to predict all negative samples, just predict some negative samples. This is a compromise. The second method is somewhat similar to this, and the name is OHEM( Online Hard Example Mining, online hard sample mining), before each gradient update, select the sample with a larger loss value for gradient update. This method selects negative samples from a batch to update the model. This method is proposed for object detection, whether it can be applied in NLP also depends on some experiences.

3 Loss-based Improvements

In addition to selecting high-quality negative samples, you can also consider improving the loss function to let the model automatically increase the weight of difficult samples.

3.1 Focal Loss

Focal Loss improves the cross-entropy loss function. The loss function can reduce the weight of easy-to-classify samples, so that the model can focus more on difficult-to-classify samples during training. First look at the cross-entropy loss function on the binary classification task:

After simplification:

This formula is too classic, so I won’t explain it anymore.

Focal Loss made a simple modification to the loss function, the specific formula is as follows:

The author of Focal Loss recommends 0.25 for alpha and 2 for gamma. In fact, by comparing with CELoss, it can be found that Focal Loss mainly scales the loss value further, so that indistinguishable samples will have a larger loss value. In the end, the gradient size and direction of the model are mainly determined by the hard-to-separate samples. The following picture is the result of Focal Loss test, the effect is still satisfactory.

3.2 Gradient Harmonizing Mechanism(GHM)

Focal Loss will pay too much attention to difficult-to-classify samples, and real data sets often have a lot of noise, which can easily lead to overfitting of the model. Similar to Focal Loss, GHM will also suppress the loss value, but this suppression is based on the number of samples. If the number of samples is huge and the gradients is small, so multiply them by a small coefficient. This coefficient is not adjusted by yourself, but determined according to the gradient distribution of the sample. The specific formula can refer to the original paper, and the formula is no longer posted here. Up to now, I have not heard that it is used in NLP, and how effective is it also depends on the experiences.

4 Improvement of training methods

The second and third sections are all explorations of high-quality samples. In fact, you can think from a different perspective and let the model see enough negative examples. As long as the hardware is strong enough, you can learn all the negative examples with the model, which is quite powerful. It feels like a miracle. So you can try to change the training method to make the model see more negative samples. This kind of method is very much like List-Wise in LTR.

4.1 All Other Examples in a Batch Are Regarded as Negative Samples

Taking the short text matching task as an example, suppose a batch of input is (a1,b1), (a2,b2), (a3,b3), (a4,b4), each pair of samples are similar, and the remaining samples are regarded as negative samples. For example, for a1, b2, b3, and b4 are all negative samples, so that the model can learn more negative samples without increasing the batch_size. The loss function can be cross entropy in binary-classifications. Loss function can also be used as a multi-class loss function to optimize, that is, the similarity values of (a1,b1), (a1,b2), (a1,b3), (a1,b4) are sent as logit. The classification loss function, in this example, is a four classification task, and the label is 0 (the similarity value of a1 and b1 should be the largest). If you want to use this method, remember to make sure that other samples in each batch can be used as negative samples! If a1 and b4 are also similar, then there is noise in the training data. This type of method has been used in NLP, and the performance is not bad. People who are interested can try it on their own tasks. SimBERT (a general sentence vector encoder) is trained based on this method, MeiTuan’s algorithm for getting the first place in the Microsoft reading comprehension competition also use similar loss function. “Let the model learn more negative samples without increasing the batch_size”, this sentence is the core of the fourth quarter

4.2 MOCO

The MOCO method can be regarded as a classic image pre-training method. It is self-supervised and based on representation learning, similar to Bert in NLP. One of the major innovations of MOCO is to allow the model to see a large number of negative examples at once (about 10,000). Then the problem is that the amount of calculation will consume much time. Each step has to be calculated tens of thousands of times. It is like changing the batch_size in 4.1 to 10,000. To solve this problem, MOCO created a negative example queue, it does not calculate the representations of all negative examples at once, but slowly updates some of the negative examples in this queue, that is, the advanced queue is updated first, which reduces the amount of calculation. However, part of the negative case representation is out of date, so MOCO uses the momentum method to update the model parameters. The specific process can be seen in the following pseudo code:

# f_q, f_k: encoder networks for query and key
# queue: dictionary as a queue of K keys (CxK)
# m: momentum
# t: temperature

f_k.params = f_q.params # initialize
for x in loader: # load a minibatch x with N samples
	x_q = aug(x) # a randomly augmented version
	x_k = aug(x) # another randomly augmented version
	q = f_q.forward(x_q) # queries: NxC
	k = f_k.forward(x_k) # keys: NxC
	k = k.detach() # no gradient to keys
	# positive logits: Nx1
	l_pos = bmm(q.view(N,1,C), k.view(N,C,1))
	# negative logits: NxK
	l_neg = mm(q.view(N,C), queue.view(C,K))
	# logits: Nx(1+K)
	logits = cat([l_pos, l_neg], dim=1)
	# contrastive loss, Eqn.(1)
	labels = zeros(N) # positives are the 0-th
	loss = CrossEntropyLoss(logits/t, labels)
	# SGD update: query network
	loss.backward()
	update(f_q.params)
	# momentum update: key network
	f_k.params = m*f_k.params+(1-m)*f_q.params
	# update dictionary
	enqueue(queue, k) # enqueue the current minibatch
	dequeue(queue) # dequeue the earliest minibatch

logits = cat([l_pos, l_neg], dim=1)This line of code is to splice the dot product value of the positive sample and the dot product value of the negative sample in the last dimension. f_k.params = m*f_k.params+(1-m)*f_q.params This line of code is to update the model parameters with momentum. If you don't have a certain foundation of Pytorch, it is still difficult to understand, I suggest you read the paper to deepen your understanding. The following figure is the performance evaluation of MOCO.

It can be seen that the performance of MOCO is better than other similar methods, and the larger the number of negative example queues, the better the performance. This method has been tried on my own tasks, and the performance is good. You can try it on your own task.

5 Summzrize

There is no free lunch in the world. Many algorithms is unclear until they are tested on your own tasks. It is recommended that you watch the test results and think about the optimization direction of the algorithm based on real data analysis. The method described in this article is simple, I hope you can contribute more methods. NLP is known as the jewel in the crown of artificial intelligence, however I have not seen this jewel shine. The methods introduced in this article are almost all from the field of images recognition, which is disappointing. I hope to see more innovations in NLP in the future.

Finally, thank you for reading, and hope to help you.

The article can be reprinted, but please indicate the source:

6 References

OHEM: Training Region-based Object Detectors with Online Hard Example Mining
S-OHEM: Stratified Online Hard Example Mining for Object Detection S-OHEM
A-Fast-RCNN: Hard positive generation via adversary for object detection
Focal Loss: Focal Loss for Dense Object Detection
GHM: Gradient Harmonized Single-stage Detector
MOCO: Momentum Contrast for Unsupervised Visual Representation Learning
https://zhuanlan.zhihu.com/p/60612064
https://www.cnblogs.com/rookiechenv587/p/11973078.html