Hybrid Triplet Mining for Siamese Neural Networks

Published in

Ixor

4 min readJul 23, 2018

At IxorThink we are currently working on a visual fashion recognition system using deep learning. We are developping this model for a Belgian company which makes it possible to link online and television content to brands and e-shops. At this moment, clothing of celebrities and actors is manually tagged. Improving this process through automation would facilitate the work of the people who tag new content.

The problem statement consists of linking a picture of someone wearing a specific shirt to the matching catalog product using product catalog images. This challenge was already tackled in various papers using siamese networks ([1], [2], [3]), but reaching an acceptable accuracy stays difficult. We believe the creation and selection of triplets (triplet mining) is key to a properly trained model.

Apart from using our clients data, we used the Street2Shop dataset to test our model. The implementation was done using Keras and Tensorflow.

Street2Shop is a free to use dataset, which contains product catalog pictures paired with so-called ‘street images’. The dataset contains 11 broad categories and was created by M. Hadi Kiapour et al. (Where to Buy It:Matching Street Clothing Photos in Online Shops).

Hold on, what’s Triplet Learning?

Before drowning in loss functions and triplet mining strategies, let’s take a look at the basics of triplet learning. To find the matching product for a street image, we want to learn an embedding in N-dimensional space. In this space, matching product should lay close to each other, so they can be retrieved.

A triplet network is inspired by a siamese network (but yes, like a siamese triplet), i.e. a network where the same weights are reused to compute the results for various samples. Specifically, a triplet consists of an anchor, a positive sample and a negative sample. In this case, the anchor is a cropped image of someone, while the positive is the catalog image of the same product. The negative sample is an image showing a different product. Each triplet propagates through the network, and a loss function is defined based on the three resulting embeddings. Minimizing this loss function should minimize the distance between the anchor and the positive sample, while maximizing the distance to the negative one.

Triplet Learning with data from the Street2Shop dataset.

Selecting Triplets

Offline Mining

The easiest and most intuitive method is offline triplet mining. All triplets can simply be created beforehand by taking a street image, a matching product picture and sampling a negative from all other products.

The big plus of this strategy is that it is easy to understand, supervised and straightforward to implement. However, performance is highly dependent on the creation of these triplets; if triplets are too easy, there is nothing to learn and if they are too difficult the embedding might collapse into a single point. Moreover, extra processing is required to create difficult triplets (triplets with produce high loss). These triplets can be selected based on image features like colour histograms, or even better, based on the model itself.

Online Mining

When using the offline strategy, we need N*3 images to have N triplets. Maybe we can do better? If we only send positive pairs through the network, we can sample negatives from pairs of other products. Suppose we have a cropped image of a shoe and its correct product image. This pair can be combined with any other image in the same batch to create a triplet, as long as it is an image of a different product. This way, we can use 2*N images to create up to 2*N² triplets. This is not only faster, it also makes sure that the gradients used for backpropagation are of better quality, which improves training overall.

The hybrid approach

When using online triplet mining, results are significantly better. However, after some training all created triplets will be too easy to learn from and as a result learning will stop. For example, your model might learn to recognise striped T-shirts but not the length of its sleeves. This is simply because chances are very small a batch contains both striped T-shirts with different sleeve length.

You guessed it, this brings us back to offline triplet creation. When the models learning process is almost saturated, we should make batches more difficult by adding similar products. This means freezing the model and selecting alike products based on embedding distance in an offline fashion. By combining this with the online loss calculation, a batch will contain all kinds of triplets.

Our approach: combining offline and online triplet mining

Conclusions

Triplet learning is a powerful tool to learn image embeddings to use as a similarity measure. It makes it also possible to train with an big amount of classes, as in this case, where every product is in theory a different class (we want pictures of the same product to lay closely together). However, the selection of triplets and the formulation of the loss function have an enormous influence on the properties and performance of a triplet-based model. Mining hard triplets is necessary to make proper training possible, which is why we used combination of offline and online triplet mining. Our approach assures the existence of hard triplets inside a batch (offline selection) while making it more efficient at the same time (online selection).

At IxorThink we are constantly trying to improve our methods to create state-of-the-art solutions. As a software-company we can provide stable and fully developed solutions. Feel free to contact us for more information.