The Trials and Tribulations of Automating Colour Classification

David Lourenço Mestre
Empathy.co
Published in
8 min readMay 29, 2019

Introduction

One of the growing trends in the search industry is the use of machine learning techniques to fill gaps in manually curated merchant catalogues by incorporating information using, for example, computer vision or natural language processing algorithms.

Within the search space, fashion catalogues are often large, with thousands or even tens of thousands of products. These catalogues have the natural gaps and inconsistencies that result from the manual curation of such a large number of items. At the same time fashion is in essence something deeply visual, so it is an excellent playground for computer vision algorithms.

One of the limitations often encountered is colour classification: catalogue labels for colours can often be inconsistent or lacking due to the different copy systems and responsibilities employed when building the catalogue. Here at empathy.co, to overcome this issue, we have devised an automatic process to assign consistent colour labels that can be used to complement merchant catalogues.

In this article, we’ll consider the different challenges when building an automatic colour extraction tool: the options available to build such a tool, how to keep the time it takes to process each image within an acceptable time frame, how subjective and nebulous colour naming might be, and so on.

Where to start

Our first attempt to automate colour extraction was to design a full-fledged neural network. While it was successful with single colour clothing articles, we quickly found out that articles with multiple colours were leading to ambiguous results. Improving the neural network would mean going down the rabbit hole of creating and curating a new training dataset to improve the results. Easily months of work!

We decided instead to opt for what initially looked like a less fancy strategy, but one that was informed by the different steps involved in deriving the desired colour classification: We’re not interested in the whole image, just in the foreground and the pixels that compose the main object (1). We’re not interested in the complete list of colours in the image, just a broad range of palettes (2). And, we wanted a name for each one of those broad RGB sections (3).

Finally, we were looking for a method that would work out-of-the-box for a variety of different circumstances; a fairly complex background or a neutral background, a white foreground on a darker background, or a foreground composed by black and white segments over different types of backgrounds.

Background rejection: Grab Cut

Initially we started working with lighter methods: threshold, otsu, gradients. And since we’re pretty nerdy when it comes to deep learning, we also tried mask rcnn. Some of the methods didn’t fare well when some conditions involving the relation between the background and the foreground weren’t met, while others such as mask rcnn were too slow. We ended up working with a method based on graph cut: grab cut.

Why grab cut? Isn’t it an interactive algorithm? Yes, but we made the assumption that the targeted object would be in the middle of the image, and by running a first pass we can identify the most likely position where the object might stand, and get a rectangle that contains the object.

Once having that box we can classify all the pixels outside of it as background, just like a user of grab cut would set certain hard constraints for segmentation by indicating some pixels as part of the background. That way we give the seeds for what will be background and foreground to the algorithm.

Once we have completed this step, the grab cut algorithm does an initial prediction for the labels of the other pixels in the image: it basically goes through each pixel without a label and estimates if they belong into the foreground or background.

The idea behind grab cuts is that image segmentation is akin to energy minimisation, and the minimum of that energy function should correspond to a good segmentation.

The image segmentation is defined as a cost function that sums up two terms. The first term is the cost of assigning each pixel as foreground or background based on how well each pixel fits on the gaussian mixture models for each label. The second enforces similar and neighbouring pixels to share the same region.

The process of minimisation works iteratively, and the process runs until it reaches convergence by running new estimations on the gaussian mixture models every cycle, being that new estimations should reflect the new refined background and foreground distributions.

Once the process stops, we have a binary mask with the same shape as the image parsed with grab cut.

Clustering

At this point our image is represented as an array with shape (N, 3), with N the number of foreground pixels, and each pixel having a 3-channel RGB vector. Since we might have thousands of colours we need to run a method to reduce and find the dominant RGBs. For example, let’s consider a red and green shirt, for a human eye it will look like two clear and unambiguous colours. However, at the pixel level we do not have just two RGBs, but two clouds with a high level of variance, large sets of shades of red and green. Since we do not have any ground truth classes, we cluster pixels together based on how similar they are.

We opted to work with k-means, being one of the most popular and efficient clustering algorithms. For k-means, a cluster is characterised by a centre that is the arithmetic mean of all the points on the cluster. Each point in the cluster is closer to its own centre than to other clusters’ centres. K-means requires a notion of distance between data points, for that we have used the Euclidean distance between RGBs.

The main limitation of k-means clustering is that it provides a fixed number of clusters. There are some methods to determine the ideal number of clusters in the data, but none of those can in the end guarantee that the number of clusters will match the number of distinct colours in the image.

In the end, we decided to use a fixed number of clusters and remove the possible extra clusters by running a later stage algorithm to identify similar RGBs. If we find that the difference between two RGBs is below a specific threshold, we assume both values to be the same colour with an RGB value as the weighted arithmetic mean.

To do this latter processing, we needed to know how to parse RGB information. More on that in the next section…

Colour Naming

This is perhaps the trickiest step in the process, not necessarily technically speaking, but for its implicit subjective nature. Colour naming is not a trivial subject as it’s a matter of human perception, and not a physical property like temperature or pressure. When classifying RGB values we frequently found ourselves scratching our heads and asking if a value that we were analysing was red or orange, blue or navy, pink or beige.

Our plan was to create a list of pre-classified RGBs, and find the closest colour for the RGB under consideration.

Our first approach was to employ Euclidean distance. Quickly we found out that we were on shaky grounds: the RGB colour space doesn’t represent how we humans perceive the colour, since it isn’t perceptually uniform. A change in the same amount in a colour value doesn’t necessarily imply a change of the same magnitude on the visual perception (Figure 1).

Figure 1. Euclidean distance for the RGBs [245, 0, 0], [160, 40, 0], and [90, 100, 10]. Some shades of red are closer to green than to other reds, making the Euclidean distance inadequate to properly identify colour.

We needed a way to identify similar colours, and for that having a perceptually uniform colour space was crucial. In this colour space, equal steps in delta between RGBs should be perceived as equal steps in a colour map. We opted to work with the CIELAB (CIE2000) colour space. The CIELAB colour space provides a mathematical interpretation of the visible spectrum strongly correlated with the human visual system, and a revised version of the Euclidean distance: Delta E (Figure 2).

Figure 2. CIELAB 2000 Delta E Distance for the RGBs [245, 0, 0], [160, 40, 0], and [90, 100, 10]. Now the shades of red are closer and distance to green is larger.

Having a robust method for quantitative colour comparison provides the foundations for naming the colour of a RGB value. Our strategy is to translate the RGBs into LAB values, and apply the Delta E against a pre-classified list of colours, the closest one resulting in the desired colour label.

The same CIELAB Delta E can be applied in the previous step to remove extra clustering data, since it also involves comparing RGB values and quantifying the difference to remove clusters that will result in the same or a very similar perceived colour.

Results

Here are a few examples of fashion images that have been run through our classification pipeline, together with the final results — an RGB average value, a colour label and the relative weight for each dominant colour cluster that has been identified.

Conclusion

By splitting the colour classification problem into a few basic steps, leveraging machine learning methods and complementing it with basic colour perception theory, we’ve been able to devise a process that can be used to quickly, and robustly, attach colour labels to fashion images.

In our internal testing, this piecemeal approach fares much better than more straight-forward deep learning classifiers. Which goes to show how, in many cases, applying domain model expertise allows us to maximise the benefit of machine learning algorithms.

--

--