The Challenging Dimensions of Image Recognition (part 1)

David Lourenço Mestre
Empathy.co
Published in
5 min readMar 2, 2018

As consistent product data greases the wheels of eCommerce, the inconsistency that emerges from dealing with a large number of fashion items raises different challenges. How to organise and standardise the product data received from different retailers? Or, how to supply the missing information and ensure accuracy across distinctive catalogues?

For example, the colour for the same jacket may be classified by Retailer A as “salmon”, “light-red” by Retailer B, and Retailer C might use an abbreviation such as “RD”. Also the descriptive category/name for the same jacket might vary across the suppliers.

To save us from the human effort required to curate and clean data, one method is to use automated solutions that standardise information through robust tools to clean up the data received from different sources and extract attributes and categories from fashion items.

There are however problems to overcome by using this method, for example if a random image contains clothing, how do you predict the clothing type through multi-class classification, how to you predict the colours, and find attributes such as strips, types of sleeves and collars?

For a human eye, it might be an easy task to identify a huge range of objects and detect features such as colour or to distinguish a coat from a shirt. To recognise patterns and similarities comes naturally to us but for a computer, image recognition is still a significant challenge.

And while working with convolutional neural networks, it´s also important to cover classical architectures to know where this takes us.

Image Segmentation

Optimising a solution for fashion classification is a fundamental vision problem. There are multiple techniques, first it´s important to work with image segmentation, and use this approach to analyse an image based on the abrupt changes in the homogeneity attributes of pixels. Image segmentation should be used as a first step in image analysis and recognition.

It´s also essential to note that when processing images, it´s key to focus on the features that represent the dominant object. However, this is not always as simple as it sounds as each image may well have a background of some sort; a plain white backdrop, or a complex setting such as a street scenario.

To address this problem, there are many different methods available for image segmentation, one option is to look for a light solution. Watershed is a good background removal solution, and one of the most common algorithms used for image segmentation. It uses the process of extracting information from an image using groups of pixels with regions of similarity.

Starting from user defined markers, the Watershed algorithm treats the grayscale image as if it was composed of “high” and “low” areas. The algorithm floods the “low” areas using the values set on the markers. The values above the threshold markers are then used to extract the dominant object from the image, by creating an alpha channel that reflects the proportion of foreground and background.

Original Image
Alpha Channel

However, Watershed doesn´t always work well for images with fairly complex backgrounds.

Another solution is to work with GrabCut. Albeit slower than Watershed, it performs well on complex backgrounds. GrabCut tends to be used as an interactive foreground extraction tool, but it can be tweaked to work on an autonomous fashion. When applying it, it´s fair to make the assumption that the product would be centred within the image.

By setting a box over an image, the algorithm defines everything outside of the box as a known background, and the data inside it is classified as unknown. The machine does an initial classification based on the entry values and tries to estimate as to which class the unknown pixels belong. Through an iteration process, GrabCut applies a probabilistic function to identify the probable foreground and the probable background. The process is repeated until the classification converges.

Clustering

An image can be viewed as a large array of discrete pixels. For an image encoded with three channels (Red, Green and Blue), each pixel represents a colour. Therefore, clustering is a good method to extract the colours from a fashion article and to divide and group similar pixels.

After all, when we as persons define something as Red or Blue, we are also clustering specific wave lengths into similar groups, and categorising those groups with a symbolic name.

To cluster image data, one technique is to work with an unsupervised machine learning algorithm such as k-means. K-means relocates centroids to minimise the sum of squared distances between the centroids and the points that form each pixel.

The process runs k-means with a number of centroids equal to the number of colours, where each centroid is assigned to a k-cluster. K-means´ algorithm is then executed for each pixel on the image. K-means clustering finds the Euclidean distance for each pixel to the cluster mean.

After some iterations, the centroid moves from a random position to a local optimum (the centre of the cluster), in a process that recalculates the centre for all clusters on each iteration. The outcome of the centroids positions as a proxy can be used for the colours that will define the palette.

The challenge with k-means is that it´s necessary to specify the number of clusters before running the algorithm, and estimating the number of clusters from an array of RGB values can be computationally expensive and a slow procedure. And even though there are some methods to assess the number of clusters, such as Elbow or Silhouette, none offer a wholly accurate estimation for the ideal number of clusters.

Conclusions

While attaining respectable results, GrabCut and clustering have a slow performance, due to the iterative refinement stage on GrabCut, as well as having to determine the ideal number of clusters beforehand, which is a problem on its own. This makes both methods problematic when used to process large catalogues. In the next post, we´ll continue this exploration by looking at deep learning, and how we can use this method to automatise some of our tools.

--

--