For Item Retrieval by Image
The image search/visual search technologies can bring different and new item search experiences to users in e-Commerce sites.
Introduction
Last year, we released a feature that automatically fills item name, category, and brand for the Japanese Mercari app, by utilizing image recognition technology. We also released a similar feature on our US app earlier this year.
In addition to improving this recognition engine, we are also exploring other functions with image related technologies — one of which is item retrieval by image.
In general, in order to retrieve images by using a query image, images are represented as feature vectors. Images whose feature vectors are very similar to the query image’s feature vector are returned as similar images. For the feature extraction part, SIFT, HoG, etc., were used for a while. Recently, deep learning-based feature extraction (deep features) is often used for that purpose.
Deep features are extracted from intermediate layers of a deep neural network (DNN). Since there are many intermediate layers in a DNN, let’s see what kind of image retrieval results are obtained with some of the layers.
Image Feature Extraction
For our image recognition function, we have built a model based on Inception-v3. This is a variety of convolutional neural network (CNN) model proposed by Google. We have trained the model using images of items listed on Mercari, such that it can classify item category and brand. In this experiment, the model is used for image feature extraction.
Figure 2 shows the Inception-v3 model structure. Its characteristic structure includes branches and joins in the network. Each part (from branch to join) is called an inception module.
Here we use the following three layers for image feature extraction.
(A) After the first inception module: The 35 x 35 x 256 dimensional vector is converted to a 256D feature vector with global average pooling.
(B) After the eighth inception module: The 17 x 17 x 768 dimensional vector is converted to a 768D feature vector with global average pooling
(C) After the last inception module: The 8 x 8 x 2048 dimensional vector is converted to a 2048D feature vector with global average pooling
Global average pooling is simply used for dimensionality reduction.
In general, layers close to the input layer represent visual image features and layers close to the output layer represent semantic image features. In the case of our DNN model, since it has been trained to classify categories and brands, feature vectors from layers close to the output layer are used for category and brand classification rather than representing visual similarity.
Dataset
We use 1 million images randomly sampled from Mercari item images. The three types of image features (layers A, B and C) are extracted with the Inception-v3 model from the all images.
Similar Item Retrieval
We use cosine similarity as a measure of the similarity between a query image feature vector and 1 million image feature vectors. The top 10 similar items for layers A, B and C are listed below.
Example 1
The query image is a red knit top.
When using feature vectors from layer A, obtained items look similar, but their categories are often not knit tops, for instance, skirts and T-shirts are also obtained. In the results for layers B and C, mostly knit top items are obtained, and with layer C the extracted items are exclusively knit tops rather than visually similar items.
Example 2
The query image is a striped knit top.
The result from layer C contains a white plain knit top, which actually also is a stripe pattern knit top, so it’s not a mistake. For layer A, although striped items are retrieved, some of them are not knit tops.
Example 3
It’s a blue down vest.
With layer A feature vectors, even though down jackets are extracted, most of them are not down vests. Layers B and C correctly obtain down vests. However, in the results of layer C it can be seen that the color sensitivity is diminished.
Example 4
This query is a striped Nike knit beanie.
Using layer A feature vectors, a few knit beanies are extracted, but also includes items that just have a stripe pattern. It seems difficult to directly use the features of layer A. In the case of layers B and C, only knit beanies are obtained. Interestingly, the result based on layer C contains more Nike products.
Conclusion
We experimented with item retrieval by image, based on a deep neural network model which has been trained to classify item category and brand. Three types of image feature vectors, extracted from different layers in the model, were used.
Each feature type has different characteristics, as can be seen in the retrieval results. Features from a layer close to the input layer represent image texture patterns rather than item category. On the other hand, features from a layer close to the output layer have a higher sensitivity for item category and brand, rather than visual features.
Feature vectors from different layers can be combined and result in different kinds of retrieval results.
To implement item retrieval by image, there are more issues besides correctness, such as computation time and indexing in real time. These are challenges we need to deal with one by one.
We are actively hiring machine learning and natural language processing engineers, so if you are interested in working at Mercari, contact us!