Image Representation Technique for Visual Search on a C2C marketplace

A challenge of visual search and image search on a C2C marketplace is how to handle the visual gap between query images and listing images. Our solution can close the gap and bring better search results.

Takuma Yamaguchi (Kumon)

Published in

Making Mercari

7 min readNov 11, 2019

Figure 1: Visual Search Feature at Mercari Japan

Introduction

The explosive increase of online photos, driven by social networking and e-commerce sites, has focused researchers’ attention on visual search, also called content-based image retrieval [1–3]. Many newly posted photos are listed on consumer-to-consumer (C2C) e-commerce sites, where most sellers are not professional photographers or retailers; therefore, buyers are often stymied by the limited quality and quantity of item information and keywords. Moreover, buyers might not even know the correct keywords to use to find their desired items.

In such a situation, image-based item searches may improve the user experience. Algorithms for extracting image features based on deep convolutional neural network (CNN)[4] and approximate nearest neighbor (ANN) search [5] can be used to realize a simple visual search system.

However, even if these simple systems can retrieve visually similar images, their results could be non-optimal. C2C e-commerce site search algorithms tend to extract items listed by professional sellers even if more relevant items are listed by nonprofessional sellers because the query images are often more visually similar to images taken by professionals than those provided by nonprofessionals, especially in apparel categories.

Figure 2: Query image and its visual search results among 100 million items.

Specifically, fitted apparel images (Fig. 2b) are likely to be retrieved in response to a fitted apparel query image (Fig. 2a). Professional and nonprofessional sellers tend to upload fitted and flat apparel images, respectively. Searches that return many items listed by professional sellers can cause problems for C2C e-commerce sites, for example, by hurting buyer experience and discouraging nonprofessional sellers from listing items [6].

We call apparel “fitted’’ if it is pictured being worn by a model and “flat’’ if it is instead laid flat on a surface, and “professional sellers’’ denotes undesirable sellers who are running full-time businesses and/or have the ability to do bulk-listings on the site.

To manage these issues so as to retrieve more flat apparel items, we developed an image representation technique that closes the visual gap between fitted apparel query images and flat apparel images in a database. The technique consists of extracting features using a lightweight deep CNN and transforming query features; it enables the retrieval of flat apparel images (Fig. 2c) from a fitted apparel query image (Fig. 2a).

Moreover, the feature transformation step can be applied to any query vector because it causes no significant side effects to flat apparel and non-apparel query vectors. Thus, additional information of whether the query image contains fitted apparel is not required before feature transformation.

Visual Search Architecture

Our visual search architecture simply consists of image feature extraction, query feature transformation, and a nearest neighbor vector search (Fig. 3). For C2C e-commerce sites specifically, this feature transformation closes the distance between a fitted apparel query vector and flat apparel database vectors. An approximate nearest neighbor (ANN) algorithm accomplishes the nearest neighbor search in a large database within a practical runtime. In our experiments, we used IVFADC [5] to retrieve visually similar images from among 100 million images.

Feature Extraction

For feature extraction, we adopted MobileNetV2 [7], which is a lightweight CNN model. Such a lightweight extraction model works efficiently in an edge device and consumes only several megabytes of memory space.

We prepared a dataset consisting of images and their metadata collected from an online C2C marketplace with over one billion listings. The dataset has 9 million images belonging to 14,000 classes, which are combinations of item brands, textures, and categories; e.g. Nike striped men’s golf polo. Images from non-apparel categories (such as laptops, bikes, and toys) are included in the dataset.

The model was trained on the dataset with a width multiplier [7] of 1.4 as a classification problem. The output of the global average pooling layer was used as an image feature vector that has 1,792 (1,280 x 1.4) dimensions. Then, the feature vectors of the query and database images were extracted using the same feature extractor.

Feature Transformation

Only the query feature vectors were calibrated using a feature transformation vector, which expresses a human feature vector intuitively, to close the gap between fitted apparel query feature vector and flat apparel database feature vectors. The feature transformation vector was trained through Algorithm 1 with 80,040 images belonging to 15 apparel categories, such as tops, jackets, pants, and hats.

In the training step, a gap vector, which represents the difference between fitted and flat apparel feature vectors, was calculated for each category and the feature transformation vector was computed by averaging the gap vectors. For a query, the transformation simply subtracts the feature transformation vector from a query image feature vector (Algorithm 2); its computation time is negligibly small.

The feature vector extracted from MobileNetV2 initially lacks negative value elements owing to the use of the rectified linear unit (ReLU) activation function. The negative value elements in the feature vector space can be treated as unnecessary; i.e. elements are replaced with zero in Algorithm 1 and 2, a step that is key to preventing side effects in query feature transformation. Even if the feature transformation designed for a fitted apparel query vector is applied to a flat apparel or non-apparel query vector, the essential feature is still preserved by removing negative value elements.

Figure 4: The visual representation of Algorithm 1, where f() represents the feature vector extractor.

Experiments

Table 1: Visual Search Results for Apparel Categories (mAP@100)

We conducted experiments to evaluate our proposed method. For these experiments, we collected 20,000 images from a C2C marketplace: half of these images were those of flat apparel and the remaining were fitted apparel images. The flat apparel images belong to ten categories, shown in the first column of Table 1. Fitted apparel images not belonging to the ten categories were also included. For fitted apparel queries, cropped images of the query objects were prepared manually from the original images to reduce the influence of the background. The mean average precision at 100 (mAP@100) was used as an evaluation measure for each category.

For baseline image representation, a vector from the global average pooling layer of MobineNetV2 was used for query and database images. The results demonstrate a significant improvement for the fitted apparel queries in every category. Although query feature transformation was designed to close the gap between fitted and flat apparel vectors, it also positively influenced flat apparel queries. These results imply that our proposed method enables more essential features to be extracted from query images.

Figure 5: The first column shows query images and the next seven columns show the results with the baseline image representation. The remaining columns show the results obtained using query feature transformation.

We also collected 100 million images belonging to over 1,000 item categories, including non-apparel images. Fig. 5 presents the visual search results from the 100 million images for fitted apparel and non-apparel queries. To demonstrate the versatility of the proposed method, the fitted apparel queries also contain images from a different dataset, ATR [8], which was originally used for a human parsing task.

For fitted apparel queries, our proposed method retrieved a greater number of visually similar flat apparel items. In addition, no serious negative impact was observed for non-apparel queries. The runtimes of the image feature extraction method and the nearest 100 vector search were approximately 40 and 70 ms, respectively, using an 8-core 2.3 GHz CPU.

Conclusion

We proposed an image representation technique for visual search at C2C e-commerce sites. The proposed method, comprising a deep CNN-based feature extraction and query feature transformation, significantly improves conventional visual search methods for comparing images of fitted and flat apparel. Additionally, the proposed method did not negatively impact either flat apparel or nonapparel queries in a serious manner. The performance and total runtimes of our visual search system were practical in the experiments described, indicating that it can be successfully deployed to a major online C2C marketplace.

REFERENCES
[1] F. Yang, et al. Visual Search at eBay. ACM SIGKDD, 2017.
[2] A.Zhai, et al. Visual Discovery at Pinterest. WWW Companion, 2017.
[3] Y. Zhang, et al. Visual Search at Alibaba. ACM SIGKDD, 2018.
[4] A. B. Yandex, et al. Aggregating Local Deep Features for Image Retrieval. ICCV, 2015.
[5] H. Jegou, et al. Product Quantization for Nearest Neighbor Search. TPAMI. 33, 1, 2011.
[6] A. Hagiu et al. Network Effects Aren’t Enough. Harvard Business Review 94, 4, 2016.
[7] M. Sandler, et al. MobileNetV2: Inverted Residuals and Linear Bottlenecks. CVPR, 2018.
[8] X. Liang, et al. Deep Human Parsing with Active Template Regression. TPAMI. 37, 12, 2015.