Deep image understanding at Carousell
In the past five years, Carousell has led the way in mobile classifieds and is one of the fastest growing mobile marketplaces in Southeast Asia. We’re in 19 cities across 7 countries. We’re looking at ways to leverage machine learning to enhance the user experience.
At Carousell, my team develops machine learning features that help our users list, sell, and buy items more easily. We train our models on Carousell’s sizeable internal datasets of items for sale and user interactions. Our first feature powered by machine learning suggests titles and categories for your listing, based on the images that you upload. This is available in Singapore on the Android app, and is in the process of rolling out on iOS and in other countries.
We train deep convolutional neural networks on our database of tens of millions of listings, to classify images into their categories. This classifier is used to provide category suggestions in the app.
However, treating title prediction as a categorisation task like this would not work well, as there are so many different titles in our data. Instead, we trained a ranking model that takes an image and attempts to select the correct title out of a pool of candidate titles.
The neural network for ranking titles has two halves. One half looks at the image using deep convolutional layers; the other looks at potential titles, processing the words and phrases using embeddings and a deep neural structure.
The two halves map images and titles to a shared high-dimensional vector space, and vector similarity is then used for ranking.
Our network is learned jointly from scratch with a single ranking loss function. This structure allows for a lot of pre-computation in training and inference.
When a new image is uploaded to Carousell, the model ranks a list of titles derived from hundreds of thousands of listings to find good suggestions in under 100 milliseconds.
We train our models across multiple GPU machines in parallel for hundreds of millions of steps (but keep training time down to a couple of days to allow for quick development).
Our best network is a joint model that predicts the category and ranks titles using a shared deep representation of the image.
If you’re already using these new machine-learning powered features on our marketplace, thank you for trying them. If you haven’t yet, we hope you’ll give it a try soon.
We expect the real learning to happen from your interaction with the features, i.e. which suggestions you click on, and as more and more people use it.
We are currently hiring data scientists and machine learning engineers to join us in building more features like this.