Learning To Differentiate using Deep Metric Learning
Recently, computer vision algorithms have contributed greatly to developing very efficient visual search workflows using Convolutional Neural Networks (CNNs). Since data volumes have been increased in recent times, object recognition models are able to recognize the objects and generalize the image features at scale. However, in a challenging classification setting where the number of classes is huge, there are several constraints one needs to address to design effective visual search workflows.
- An increasing number of object categories also increases the number of weights in the penultimate layer of CNN. It makes it hard to deploy them on the device as they also end up increasing the model size.
- When there exist only a few images per class it is difficult to achieve better convergence, making it hard to achieve good performance under various illumination differences, object scale, background, occlusion, etc.
- In an engineering space, it is often required to design visual search workflows that are adaptive to the product ecosystem that contains non-stationary products or it changes as per seasonal trends or geographic locations. Such circumstances make it tricky to train/finetune model in recurring (training after certain time intervals) or online (training for realtime data) fashion.
Deep Metric Learning
To alleviate these issues, deep learning and metric learning collectively form the concept of Deep Metric Learning(DML), also known as Distance Metric Learning. It proposes to train a CNN based nonlinear feature extraction module (or an encoder), that embeds the extracted image features (also called embeddings) that are semantically similar, onto nearby locations while pushing dissimilar image features apart using an appropriate distance metric e.g. Euclidean or Cosine distance. Together with the discriminative classification algorithms, such as K-nearest neighbor, support vector machines, and Naïve Bayes, we can perform object recognition tasks using extracted image features without being conditioned about the number of classes. Note that, the discriminative power of such a trained CNN module describes features with both the compact intra-class variations and separable inter-class differences. These features are also generalized enough, even for distinguishing new unseen classes. In the following section, we will formalize the procedure to train and evaluate a CNN for DML using a pair based training paradigm.
Formalism
Let X = {(xᵢ, yᵢ)}, i ∈ [1, 2,… n] be a dataset of n images, where (xᵢ, yᵢ) suggests iᵗʰ image and its corresponding class label. The total number of classes present in the dataset is C, i.e., yᵢ ∈ [1, 2,… C]. Let’s consider f(xᵢ) a feature vector (or an embedding) corresponding to an image xᵢ ∈ Rᴰ, where f: Rᴰ→Rᵈ is a differentiable deep network with parameters θ. Here, D and d refers to the original image dimensions and feature dimensions respectively. Formally we define the Euclidean distance between two image features as Dᵢⱼ = ||f(xᵢ) — f(xⱼ)||², that is the distance between deep features f(xᵢ) and f(xⱼ) that correspond to the images xᵢ and xⱼ respectively. Note that although we are concerned about the Euclidean distance here, in literature there are several other metrics often used to optimize the embedding space. We will discuss this in future posts.
To learn a distance metric function f, the majority of DML algorithms use a relative similarity or absolute similarity constraints using a pair or triplet bases approaches as suggested in Fig. 2 and Fig. 3. A triplet of images can be defined as (f(xᵢ), f(xⱼ), f(xₖ)), where f(xᵢ), f(xⱼ) and f(xₖ) correspond to the feature vectors of an anchor xᵢ, positive xⱼ, and negative image xₖ respectively. xᵢ and xⱼ share similar class labels whereas xₖ has a class different class label than that of the anchor and positive image. A pair of image features corresponding to an image pair(xᵢ,xⱼ), is defined as (f(xᵢ),f(xⱼ)). It refers to as a positive pair if both the images share similar labels, and negative pair otherwise.
The whole procedure of training the end-to-end DML model can be summarized as shown in Fig. 4. Initially, to get the idea of cluster inhomogeneity a batch of images is sampled. Each batch contains object classes P with Q images each class. We use this batch to form one or more mini-batches using a sampling strategy discussed below. These mini-batches are used to compute loss and perform training via backpropagation. Let’s summarize the training procedure to train a deep learning model using DML loss functions. Later, we will discuss a few important training components of this framework, sampling, and loss function.
Training Procedure
1. Batch sampling: Batch size B, number of classes P, and number of images per class Q.
2. Inputs: An embedding function f (that is an Imagenet Dataset pre-trained CNN), learning rate b, the batch size of B and number of image classes P, the total number of images in a batch B = PQ
3. Feature Extraction: Given parameter state θₜ, feedforward all batch images using a CNN, to obtain image embeddings f(xᵢ).
4. Sampling: Mini-batch computation from the batch. Depending on the size of batch one can form one or more mini-batches of feature vectors corresponding to the images sampled in step 1.
5. Loss Computation and Training: For each minibatch compute gradients and backpropagate to update the parameter state from θₜ to θₜ₊₁.
Metric Learning Loss Function
When we aim to recognize the object using a convolutional neural network, Softmax Cross-Entropy (CE) loss function is the most common choice. However, while plugging this loss function to learn a DML model, there are few considerations one must take into account.
- Softmax Cross-Entropy(CE) loss is seen as a soft version of the max operator. Logit vectors or class probabilities, if scaled using a constant scaling factor, s, does not affect the class assignment of a given image. As a result, well-separated features possess bigger magnitudes and it promotes class separability. In sum, it causes the feature distribution to be ‘radial’ as shown in the figure below and to classify features using discriminative feature learning algorithms, for instance, KNN classifiers, it is crucial to have an embedding space that is not only separable but also discriminative. Metric learning loss functions are designed to learn a feature space that is discriminative.
- CE loss does not leverage structured relations of samples in a mini-batch of images directly, as each image(that is randomly chosen) is individually responsible to compute a loss number. Having a training paradigm that distinguishes a pair or a triplet of images, by introducing a penalty that addresses semantic differences among them, effectively examines relationships among the images.
Acknowledging these aspects, the research community has proposed a variety of loss functions to learn discriminative feature space using DML. Lifted structure loss function is one of them.
Lifted structure loss makes full use of mini-batch and improves a mini-batch stochastic gradient descent training. On one hand, triplet loss or contrastive loss uses a triplet or pair of images respectively to compute a loss term, lifted structure loss proposes to employ a pairwise distance metric (O(m²)) by lifting all the image pairs available in a mini-batch (O(m)). Moreover, in contrast to triplet loss or contrastive loss where a negative sample is defined regarding an anchor image only, lifted structure loss exercises both the images in a given pair, anchor, and positive, to find their negative examples from a mini-batch of images. Please refer to blog post above for additional information about few of the loss functions used in DML. The equation of lifted structure loss is as follows.
Here, (i, j) suggests a positive image pair correspond to images (xᵢ,xⱼ), that share similar labels. Dᵢⱼ is the distance between a pair of images. Dᵢₖ and Dⱼₗ are the distances from anchor and positive to the rest of the negative images in a mini-batch. α refers to the distance margin. P and N are all the positive and negative pairs available in a mini-batch respectively.
Sampling
Admittedly, directly acting on the distances between the pair of features intuitively lead us towards the goal of learning meaningful embeddings of images. Therefore, the standard cross-entropy loss has been mainly overlooked by the DML community.
In the DML training mechanism where we use an absolute or relative similarity using a pair or triple of images respectively, it is imperative to meaningfully sample image batches while feeding images to CNN. For a dataset, where a mini-batch size, use to feed forward the images, is considerably bigger than the total number of classes in the dataset, randomly feeding images will guarantee the majority of image samples will have other images with similar class labels in a minibatch. But, when we consider a dataset with a large number of image classes, for instance, Stanford Online Products dataset, where the number of image classes is nearly 22000, a randomly sampled mini-batch of images does not necessarily contain image pairs that share similar labels. In this case, although every batch encounters the inter-class variations(as there exist images with different class labels), it fails to address the intra-class variations(as it is not necessary to have two images with similar class labels), eventually failing to achieve better convergence.
Although training using DML requires sampling of image pairs or triplets, such sampling roughly grows the dataset size of the order O(m²) or O(m³) respectively. Additionally, if images pairs or triples are sampled randomly, the majority of triplets or pairs of images contribute in a minor way as the training advances because not all of them violates the margin α (for instance, in the case of triplet loss). It is difficult to compute a meaningful loss and It inevitably results in slow convergence.
To overcome these issues, there are various sampling strategies, we can use for the faster and better convergence of training parameters.
Hard negative data mining strategies are common-place in distance-based metric learning algorithms. It involves computing hard negative or positive feature instances to form a positive or negative pair given an anchor example. However, when there are a large number of classes involved in the given batch, this procedure is computationally challenging. In such a scenario, it is convenient to perform negative “class” mining instead of negative “instance” mining. Follow the procedure below to perform negative “class” mining for a pair based loss function.
- To have a meaningful representation of an embedding space for the given parameter state θₜ, sample a large batch of images containing a few hundred classes(that is P in training procedure above). For every pair based loss function, it is imperative to have at least two example images for every given class in a given batch. For instance, in the case of the Stanford Online Products dataset, it is imperative to sample at least 2 images (Q=2) randomly.
- For a given parameter state θₜ of a given CNN, extract feature vectors for each image sampled in step 1. Obtain class representation vectors or class proxies (mean embedding vectors) using these image features.
- For every randomly chosen class, from P classes sampled in step 1, we sample the nearest class and re-rank the corresponding images as shown in Fig.5 above. This step can also be performed using a margin-based class selection, were only a class that violets a distance margin, is chosen for a given anchor class as the nearest class.
- Depending on the computational capacity we can form one or more mini-batch of feature vectors as suggested in Fig.5 to compute loss and gradients.
Evaluation and Inference
In contrast to the traditional object recognition model where images are fed to generate object class probabilities, DML images are fed to extract the image features. These image features are assessed for their quality of clustering and retrieval performance. F1 and Normalized Mutual Information(NMI) scores are standard evaluation metrics, we use to estimate the clustering quality metric. For retrieval, recall at k is the benchmark evaluation measures we use while dealing with DML training. Here we have summarized these evaluation metrics for completeness of this article.
- Recall at K: For each of the query image (from the test dataset), we retrieve K nearest neighbor using an appropriate distance metric (Euclidean or cosine) from the same test set. Query image receives the score 1 if there exists an image from the same class among K nearest neighbors retrieved. Recall at K measure means such recall numbers for all the mages in the test dataset.
- F1: F1 metric score is defined as a harmonic mean of precision and recall measured at K, i.e. F1 = 2PR/(P+R).
- Normalized Mutual Information(NMI): For a set of input cluster assignments, Ω, and ground truth clusters ℂ, NMI score is the ratio of mutual information and the average entropy of clusters and the entropy of labels. That is, NMI = I(Ω;ℂ) / 2(H(Ω) + H(ℂ)). Here, Ω ={ω₁, ω₂ … ωₙ}, that is an input set of clusters and ℂ = {c₁, c₂, … cₙ} are the ground truth classes. Examples with cluster assignment i is given as ωᵢ, whereas examples with the ground truth class label j are defined as cᵢ.
- Accuracy at K: This metric serves our goal of using this trained model as an object recognition model in an unsupervised manner. For an image, K nearest neighbors are obtained using an appropriate metric. A query image is assigned to a class that appears the maximum number of times in the K nearest neighbors. Accuracy at K averages such accuracy numbers for each query image in the test dataset.
Conclusion
We described a deep metric learning paradigm to solve the object recognition problem. Such model training offers several advantages that are as follows.
- They do not increase the model size as we can always use the same dimension embedding layer to training the model.
- Since we are learning to differentiate and not to recognize the objects, we can leverage such a paradigm that can perform training with fewer images per class using various sampling strategies and can infer recognition even for unseen classes.
- Fine-tuning on a newer set of product classes in recurring or online fashion would all about propagating the gradients.
These properties enable us to design visual search workflows that are flexible and scalable for a given product ecosystem. Additionally, we also described the importance of sampling for training a CNN using a DML loss function. Please follow up on this article for the additional loss functions. With the advancement of this field in recent times, there are also some other training procedures we can effectively practice to achieve the same objective. We will discuss some of them in future posts.
References
- Wang, X., Hua, Y., Kodirov, E., Hu, G., Garnier, R., & Robertson, N. M. (2019). Ranked list loss for deep metric learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 5207–5216).
- Movshovitz-Attias, Y., Toshev, A., Leung, T. K., Ioffe, S., & Singh, S. (2017). No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision (pp. 360–368).
- Ranjan, R., Castillo, C. D., & Chellappa, R. (2017). L2-constrained softmax loss for discriminative face verification. arXiv preprint arXiv:1703.09507.
- Wang, F., Xiang, X., Cheng, J., & Yuille, A. L. (2017, October). Normface: L2 hypersphere embedding for face verification. In Proceedings of the 25th ACM international conference on Multimedia (pp. 1041–1049).
- Wu, C. Y., Manmatha, R., Smola, A. J., & Krahenbuhl, P. (2017). Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2840–2848).
- Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2016, October). A discriminative feature learning approach for deep face recognition. In European conference on computer vision (pp. 499–515). Springer, Cham.
- Liu, Z., Luo, P., Qiu, S., Wang, X., & Tang, X. (2016). Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1096–1104).
- Oh Song, H., Xiang, Y., Jegelka, S., & Savarese, S. (2016). Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4004–4012).
- Sohn, K. (2016). Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems (pp. 1857–1865).
- Schroff, F., Kalenichenko, D., & Philbin, J. (2015). Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 815–823).
- Bellet, A., Habrard, A., & Sebban, M. (2013). A survey on metric learning for feature vectors and structured data. arXiv:1306.6709.
- Hadsell, R., Chopra, S., & LeCun, Y. (2006, June). Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06) (Vol. 2, pp. 1735–1742). IEEE. Chicago