Deep image understanding at Carousell

In the past five years, Carousell has led the way in mobile classifieds and is one of the fastest growing mobile marketplaces in Southeast Asia. We’re in 19 cities across 7 countries. We’re looking at ways to leverage machine learning to enhance the user experience.

At Carousell, my team develops machine learning features that help our users list, sell, and buy items more easily. We train our models on Carousell’s sizeable internal datasets of items for sale and user interactions. Our first feature powered by machine learning suggests titles and categories for your listing, based on the images that you upload. This is available in Singapore on the Android app, and is in the process of rolling out on iOS and in other countries.

Suggested categories and titles based on the image. The network’s third suggestion “Yamaha Keyboard” is correct.

We train deep convolutional neural networks on our database of tens of millions of listings, to classify images into their categories. This classifier is used to provide category suggestions in the app.

However, treating title prediction as a categorisation task like this would not work well, as there are so many different titles in our data. Instead, we trained a ranking model that takes an image and attempts to select the correct title out of a pool of candidate titles.

The neural network for ranking titles has two halves. One half looks at the image using deep convolutional layers; the other looks at potential titles, processing the words and phrases using embeddings and a deep neural structure.

The two halves map images and titles to a shared high-dimensional vector space, and vector similarity is then used for ranking.

Our network is learned jointly from scratch with a single ranking loss function. This structure allows for a lot of pre-computation in training and inference.

When a new image is uploaded to Carousell, the model ranks a list of titles derived from hundreds of thousands of listings to find good suggestions in under 100 milliseconds.

The shared image and title space learned by the deep neural network. The network has learned to put images and their corresponding titles nearby. It has learned implicit clusters like clothes, games and electronics. It still makes some mistakes, for example it put the title “IKEA cushion” too close to the image of the Hermes handbag, and it did not learn to identify the “Sketch Drawing” with high confidence. The high-dimensional space is projected down to 2 dimensions for the visualisation.
The difference between the deep vector representation of the red phone case and the grey one gives a semantic ‘red’ direction in the vector space. Adding the red vector to other images allows us to ‘turn them red’.

We train our models across multiple GPU machines in parallel for hundreds of millions of steps (but keep training time down to a couple of days to allow for quick development).

Our best network is a joint model that predicts the category and ranks titles using a shared deep representation of the image.

A larger sample of the vector space learned by the network, showing only images. Some well-defined clusters include women’s shoes at the top, clothes at the bottom, and mobile phones to the left.

If you’re already using these new machine-learning powered features on our marketplace, thank you for trying them. If you haven’t yet, we hope you’ll give it a try soon.

We expect the real learning to happen from your interaction with the features, i.e. which suggestions you click on, and as more and more people use it.

We are currently hiring data scientists and machine learning engineers to join us in building more features like this.

Like what you read? Give Matt Henderson a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.