Large Scale Image Recognition for Online Trading Platforms
OLX Group, a division of Prosus, operates a network of trading platforms present in more than 30 countries in over 50 languages. OLX’s platforms make it easy to connect people to buy, sell or exchange used goods and services. Every month, more than 300 of millions of people use OLX to buy and sell furniture, musical instruments, cars, houses and more.
Sellers are mostly non-professional individuals, but a portion of the user base consists of professional sellers such as car dealers and real-estate agencies. Sellers post listings or ads, typically composed of a title, textual description and images of the item being sold. An example is shown in Figure 1 below.
The buyers search for items by submitting textual search queries to the platform which consequently returns search results. The buyer clicks on the items of his interest to see further details and can initiate communication with the seller using messaging facilities provided by the platform.
Increasing conversion rates
We would like to provide a seamless and smooth user experience and to make selling and buying items as effortless as possible. The search engine is crucial in maintaining a great customer experience, as it returns the most relevant listings for a search query. This is essential to drive high click-through rates and conversion (or conversation) rates. Similarly, the number of search queries with zero results should be reduced as much as possible.
To improve our search function, we introduced automatic listing classification, aiming to automatically predict categories of listings so that they become easier to search.
But categorising a high volume of unique ads across thousands of listing categories is a challenge. Compared to platforms such as Amazon, most users don’t adhere to any listing standard. Plus, aside from the sheer amount of data, the variance across millions, or even billions, of images make automatic listing classification difficult.
High level overview of the solution
We took a machine learning approach to tackle this problem, as seen in Figure 2 below.
We start with training a deep neural network to classify images of listings in a supervised learning setup. We utilise the trained model to predict categories of listings in the database, which effectively creates a listing index database. By combining this index and signals from the users, such as query strings, clicks on listings or messages, etc., we can create a mapping between user queries and listing categories. Thus, the visual information can be incorporated into the search engine as an auxiliary category filter given the queries from users. The second way to enhance search relevance is to learn embeddings for search queries, listings and categories of listings by utilising listing indexes together with user behaviour. In this blog, we will focus on image recognition.
Large Scale Image Recognition
Dataset: Ontology and Distribution
Our labeled image dataset is provided by one of the OLX Group companies, Avito. The dataset contains more than five million images labeled across almost 3000 categories. This dataset is five times bigger than ImageNet and contains almost three times more categories compared to ImageNet. The categories in the dataset are hierarchically organised, with more than five levels of depth. Figure 3 shows the first and a part of the second levels of the hierarchy.
Similar to most other datasets, images are not equally distributed over categories, which in turn presents data imbalance challenges. The distribution of number of images per category is illustrated in Figure 4.
We adopted Inception Resnet V2 convolutional neural network architecture for the classification problem, as it provided a good trade-off between prediction accuracy and model complexity. The network architecture can be seen in Figure 5. The network marries the ideas of stacking inception blocks and residual connections. Inception blocks can be considered as mini networks within the large network. The basic idea behind inception blocks is to combine convolution layers with different kernel sizes and pooling layers in parallel and concatenate the outputs, which helps the network to learn the best combination of convolutions and pooling itself during training. Residual connections are introduced by the ResNet architecture and quickly became de-facto network architecture for many applications. The essence of residual connections is to provide identity skip connections from earlier layers to latter layers which effectively combats vanishing gradients problem and allows to train much deeper networks with increasing gains. A simple guide to different versions of Inception network architectures can be found in this blog post.
The first phase of the training starts by initialising the weights from a pre-trained model on ImageNet. The fully connected layer needs to be replaced since the number of categories is different than ImageNet. The weights of all layers but the last layer is then frozen, and the last layer is trained for several epochs. Once the training accuracy is sufficiently high and starts to plateau, we unfreeze all the layers of the network and let the gradients flow backward until the first layer. This approach minimises the risk of destroying pre-trained weights and presumably shortens the training time. Keras with Tensorflow backend is utilised as the training environment.
We evaluate the classification performance on a test set which is a split of the dataset, and achieve 62.5%, 80.4% and 85.6% for Top-1, Top-3 and Top-5 mean prediction accuracies, respectively.
As mentioned in the dataset section, the categories are organised in a hierarchical fashion. For example:
- Consumer electronics/Computer products/Keyboards and Mouse/Gaming keyboard
- Consumer electronics/Computer products/Keyboards and Mouse/Gaming Mouse
- Consumer electronics/Computer products/Keyboards and Mouse/Apple Keyboard
- Consumer electronics/Computer products/Keyboards and Mouse/Wireless Keyboard.
Considering the fine-grained nature of categories (such as Gaming Keyboard, Wireless Keyboard and Apple Keyboard) and the hierarchical organisation of the categories, it is interesting to assess the classification accuracy when we ‘collapse’ categories to their parents in the higher levels of the hierarchy. For instance, if an image is classified as ‘Consumer electronics/Computer products/Keyboards and Mouse/Gaming keyboard’, we can ‘collapse’ the predicted category to ‘Consumer electronics/Computer products/Keyboards and Mouse’. Accordingly, the classification accuracy is expected to be higher since predicting the parent category is arguably an easier task, e.g. an image of ‘Gaming Keyboard’ is classified by the model as ‘Wireless Keyboard’, and it is assumed to be predicted correctly since the parent category is ‘Keyboards and Mouse’ for both the predicted label and true label.
Figure 6 below illustrates the confusion matrix if we ‘collapse’ the categories to the highest category level in the hierarchy, i.e. Level1 (L1). Here we report two slightly different accuracy metrics: micro-average and macro-average. Micro-average accuracy simply measures the ratio of correctly classified samples to the number of all samples, thus number of samples per class acts as weight. Macro-average accuracy is the plain average of accuracies per class (number of samples per category has no effect).
If we ‘collapse’ categories to the second highest level in the hierarchy, we achieve micro-average accuracy of 85% whereas macro-average accuracy is 84%. Collapsing to higher levels in the category hierarchy increases accuracy, as expected. We illustrate the trend of increasing accuracy with increasing aggregation in Figure 7. For instance, if we ‘collapse’ all categories to L3 (third highest level in the hierarchy), the prediction accuracy is 76%. The size of the markers represents the number of categories we end up with after ‘collapsing’.
Here is a look at some correct and incorrect classifications generated by the model after training. Figure 8 below illustrates correctly predicted images from a variety of categories.
Figure 9 below shows some misclassifications. Considering the huge number of categories and visual variance imposed by unprofessional users, it can become quite difficult to distinguish certain categories from each other, such as GPS and an iPad. It would be incorrect to attribute all errors to the difficulty of the problem. Our model also makes genuine mistakes too, confusing a massage bed with an Xbox One.
The model is deployed in AWS SageMaker. We created endpoints using AWS SageMaker and Tensorflow Serving which supports auto-scaling and can be invoked by sending HTTP POST requests. Figure 10 shows the prediction results from the endpoint for an image.
Chaining models for attribute selection
Predicting category of a listing from image is only the first step to automatic listing classification. This can be followed by listing attribute extraction as a next step. Read more about attribute extraction in our other blog. So we trained general and specialised models, where the general model acts as a filter to invoke specialised ones. For instance, if the general model classifies an image as ‘Watch’, consequently ‘Watch-Specific’ model is triggered for watch attribute extraction (see Figure 11 as an example of attributes of interest). We effectively build model chains for better modularity and maintainability.
We trained and deployed a state-of-the-art image classification model on a large-scale dataset specific to Classifieds domain by following best practices in machine learning. Classifieds domain presents unique challenges due to its extremely dynamic nature and diverse set of products and users. As a first step of exploiting visual information to enhance user search experience, we combined the predictions of the model and user behaviour in search sessions.
One of the biggest takeaways of this application is that it might be more accurate and optimal to split the category ontology into smaller groups, e.g. consumer electronics, animals, transport etc. and train models specifically on these groups. Considering the large amount of data and big sizes of neural network architectures, employing distributed multi-GPU training environment is a must-have. Another interesting direction is to directly train image embeddings by utilising the signals from user behavior and experiment with visual search and recommendation.
As a final note, we are currently migrating our image recognition solution to the EfficientNet architecture, which is the current state-of-the-art on ImageNet benchmark. We are going to create a family of EfficientNet models, addressing different set of requirements such as high accuracy or low-latency with acceptable accuracy.
I’d like to thank my colleagues from the Prosus AI team, for their suggestions and help in editing. Please feel free to ask questions / provide suggestions in the comments section or reach out to us at firstname.lastname@example.org.