Stories by David Lourenço Mestre on Medium

The Trials and Tribulations of Automating Colour Classification

David Lourenço Mestre — Wed, 29 May 2019 09:39:52 GMT

Introduction

One of the growing trends in the search industry is the use of machine learning techniques to fill gaps in manually curated merchant catalogues by incorporating information using, for example, computer vision or natural language processing algorithms.

Within the search space, fashion catalogues are often large, with thousands or even tens of thousands of products. These catalogues have the natural gaps and inconsistencies that result from the manual curation of such a large number of items. At the same time fashion is in essence something deeply visual, so it is an excellent playground for computer vision algorithms.

One of the limitations often encountered is colour classification: catalogue labels for colours can often be inconsistent or lacking due to the different copy systems and responsibilities employed when building the catalogue. Here at empathy.co, to overcome this issue, we have devised an automatic process to assign consistent colour labels that can be used to complement merchant catalogues.

In this article, we’ll consider the different challenges when building an automatic colour extraction tool: the options available to build such a tool, how to keep the time it takes to process each image within an acceptable time frame, how subjective and nebulous colour naming might be, and so on.

Where to start

Our first attempt to automate colour extraction was to design a full-fledged neural network. While it was successful with single colour clothing articles, we quickly found out that articles with multiple colours were leading to ambiguous results. Improving the neural network would mean going down the rabbit hole of creating and curating a new training dataset to improve the results. Easily months of work!

We decided instead to opt for what initially looked like a less fancy strategy, but one that was informed by the different steps involved in deriving the desired colour classification: We’re not interested in the whole image, just in the foreground and the pixels that compose the main object (1). We’re not interested in the complete list of colours in the image, just a broad range of palettes (2). And, we wanted a name for each one of those broad RGB sections (3).

Finally, we were looking for a method that would work out-of-the-box for a variety of different circumstances; a fairly complex background or a neutral background, a white foreground on a darker background, or a foreground composed by black and white segments over different types of backgrounds.

Background rejection: Grab Cut

Initially we started working with lighter methods: threshold, otsu, gradients. And since we’re pretty nerdy when it comes to deep learning, we also tried mask rcnn. Some of the methods didn’t fare well when some conditions involving the relation between the background and the foreground weren’t met, while others such as mask rcnn were too slow. We ended up working with a method based on graph cut: grab cut.

Why grab cut? Isn’t it an interactive algorithm? Yes, but we made the assumption that the targeted object would be in the middle of the image, and by running a first pass we can identify the most likely position where the object might stand, and get a rectangle that contains the object.

Once having that box we can classify all the pixels outside of it as background, just like a user of grab cut would set certain hard constraints for segmentation by indicating some pixels as part of the background. That way we give the seeds for what will be background and foreground to the algorithm.

Once we have completed this step, the grab cut algorithm does an initial prediction for the labels of the other pixels in the image: it basically goes through each pixel without a label and estimates if they belong into the foreground or background.

The idea behind grab cuts is that image segmentation is akin to energy minimisation, and the minimum of that energy function should correspond to a good segmentation.

The image segmentation is defined as a cost function that sums up two terms. The first term is the cost of assigning each pixel as foreground or background based on how well each pixel fits on the gaussian mixture models for each label. The second enforces similar and neighbouring pixels to share the same region.

The process of minimisation works iteratively, and the process runs until it reaches convergence by running new estimations on the gaussian mixture models every cycle, being that new estimations should reflect the new refined background and foreground distributions.

Once the process stops, we have a binary mask with the same shape as the image parsed with grab cut.

Clustering

At this point our image is represented as an array with shape (N, 3), with N the number of foreground pixels, and each pixel having a 3-channel RGB vector. Since we might have thousands of colours we need to run a method to reduce and find the dominant RGBs. For example, let’s consider a red and green shirt, for a human eye it will look like two clear and unambiguous colours. However, at the pixel level we do not have just two RGBs, but two clouds with a high level of variance, large sets of shades of red and green. Since we do not have any ground truth classes, we cluster pixels together based on how similar they are.

We opted to work with k-means, being one of the most popular and efficient clustering algorithms. For k-means, a cluster is characterised by a centre that is the arithmetic mean of all the points on the cluster. Each point in the cluster is closer to its own centre than to other clusters’ centres. K-means requires a notion of distance between data points, for that we have used the Euclidean distance between RGBs.

The main limitation of k-means clustering is that it provides a fixed number of clusters. There are some methods to determine the ideal number of clusters in the data, but none of those can in the end guarantee that the number of clusters will match the number of distinct colours in the image.

In the end, we decided to use a fixed number of clusters and remove the possible extra clusters by running a later stage algorithm to identify similar RGBs. If we find that the difference between two RGBs is below a specific threshold, we assume both values to be the same colour with an RGB value as the weighted arithmetic mean.

To do this latter processing, we needed to know how to parse RGB information. More on that in the next section…

Colour Naming

This is perhaps the trickiest step in the process, not necessarily technically speaking, but for its implicit subjective nature. Colour naming is not a trivial subject as it’s a matter of human perception, and not a physical property like temperature or pressure. When classifying RGB values we frequently found ourselves scratching our heads and asking if a value that we were analysing was red or orange, blue or navy, pink or beige.

Our plan was to create a list of pre-classified RGBs, and find the closest colour for the RGB under consideration.

Our first approach was to employ Euclidean distance. Quickly we found out that we were on shaky grounds: the RGB colour space doesn’t represent how we humans perceive the colour, since it isn’t perceptually uniform. A change in the same amount in a colour value doesn’t necessarily imply a change of the same magnitude on the visual perception (Figure 1).

Figure 1. Euclidean distance for the RGBs [245, 0, 0], [160, 40, 0], and [90, 100, 10]. Some shades of red are closer to green than to other reds, making the Euclidean distance inadequate to properly identify colour.

We needed a way to identify similar colours, and for that having a perceptually uniform colour space was crucial. In this colour space, equal steps in delta between RGBs should be perceived as equal steps in a colour map. We opted to work with the CIELAB (CIE2000) colour space. The CIELAB colour space provides a mathematical interpretation of the visible spectrum strongly correlated with the human visual system, and a revised version of the Euclidean distance: Delta E (Figure 2).

Figure 2. CIELAB 2000 Delta E Distance for the RGBs [245, 0, 0], [160, 40, 0], and [90, 100, 10]. Now the shades of red are closer and distance to green is larger.

Having a robust method for quantitative colour comparison provides the foundations for naming the colour of a RGB value. Our strategy is to translate the RGBs into LAB values, and apply the Delta E against a pre-classified list of colours, the closest one resulting in the desired colour label.

The same CIELAB Delta E can be applied in the previous step to remove extra clustering data, since it also involves comparing RGB values and quantifying the difference to remove clusters that will result in the same or a very similar perceived colour.

Results

Here are a few examples of fashion images that have been run through our classification pipeline, together with the final results — an RGB average value, a colour label and the relative weight for each dominant colour cluster that has been identified.

Conclusion

By splitting the colour classification problem into a few basic steps, leveraging machine learning methods and complementing it with basic colour perception theory, we’ve been able to devise a process that can be used to quickly, and robustly, attach colour labels to fashion images.

In our internal testing, this piecemeal approach fares much better than more straight-forward deep learning classifiers. Which goes to show how, in many cases, applying domain model expertise allows us to maximise the benefit of machine learning algorithms.

The Trials and Tribulations of Automating Colour Classification was originally published in Empathy.co on Medium, where people are continuing the conversation by highlighting and responding to this story.

Machine Learning Model Evaluation and Hyper-Parameter Tuning: Looking Beyond Accuracy

David Lourenço Mestre — Fri, 22 Mar 2019 10:56:45 GMT

Last month, we discussed the impact of gradient descent on deep learning optimisation and reviewed the importance of optimising the cost function. In this post, we will be exploring in detail how to evaluate model results, as well as best practice for optimising hyper-parameters.

To do this, we will be building a model for image recognition. We will show how to compute metrics to assess the quality of the model and some optimisation techniques for hyper-parameters.

Model Evaluation

Just like a student revising for an exam, your machine learning model will have to go through a process of learning and training before being ready to complete its intended task. This training will enable it to generalise and derive patterns from actual data, but how can we assess whether or not our model provides a good representation of our data? How can we validate the model and predict how it will perform with data it hasn’t seen before?

When assessing a data model, accuracy is the most frequently used metric. It gives a general understanding of how many data samples are misclassified, however this information can be deceptive and can give us a false sense of security.

Normally, you’d split your data into a training set and a test set. You’d train the model on the training set, and would measure accuracy while testing the model on the test set. This is the fastest way to evaluate the model’s performance, but it’s not the best one.

Bias-variance tradeoff

Although there are many different metrics that can be used to measure a model’s performance, keeping bias and variance low is always essential. We define bias as any systematic difference between the output of our model and the ‘true’ value. Variance refers to the model’s statistical limit (see figure for illustration).

Mastering Machine Learning with scikit-learn, pg. 14

So, in which situations will we have to combat high bias or high variance?

If the model is overtrained, or too complex for a given training dataset, it will memorise patterns and even the input noise. In such situations, we will have high variance (overfitting) and the model will perform poorly with unseen data.

In the opposite scenario, the model will perform poorly and produce similar errors on both training and testing data. In this situation, we will see high bias (underfitting). The model will be too inflexible and won’t have enough features to fully represent the data.

We ideally want to achieve both low bias and low variance. We are aiming for model predictions that are very close or identical to the values seen in the training data. Unfortunately, the efforts to reduce one often increases the other. We have to find a compromise. This balance between bias and variance is called the bias-variance tradeoff.

While it might not always be possible to find enough data to prevent overfitting, or to know exactly how complex a model should be, plotting the training and testing accuracies as functions of the number of training samples might help.

K-fold cross-validation

In this section, we will use Keras to wrap a neural network, and leverage on sklearn to run a K-fold cross validation. For the neural network, we will use LeNet architecture. It was one of the first prominent deep convolutional architectures, it’s fairly easy to code, and it’s not too computationally expensive. The architecture consists of two sets of convolutional and subsampling (also known as Pooling) layers, followed by a flattening convolutional layer, then two dense (fully connected) layers.

Deep Learning with Keras, pg. 78

Before going through the code in detail, it may be useful to define what k-fold cross-validation is and why you should use it.

K-fold cross-validation is one of the most common methods for confirming an estimated hypothesis on data, and for assessing how accurately a model performs and its ability to generalise. In k-fold cross-validation, you randomly split the training data into ‘k’ equal-sized folds. In each iteration, one of the folds is used for performance evaluation, while the rest is used for training. This process is executed ‘k’ times so that we obtain ‘k’ models and performance estimates.

Statistically, the average performance measured over k-fold cross-validation gives a proper estimate of how well a model does its task in general.

Cross-Validation Code

First, we’ll import the data from Keras. As a training set, we will be using the Fashion-MNIST dataset. Fashion-MNIST is a dataset consisting of Zalando’s product images. It has a training set of 60,000 samples and a test set of 10,000 samples. Each image within these sets is 28 pixels by 28 pixels.

Before passing the data to our model, we must declare the number of channels (also known as the depth of the image) and reshape the samples to 60,000 x [1, 28, 28] to suit the convolutional requirements. Note that the dataset is composed of black and white images. For that reason, we have just one channel.

https://medium.com/media/36384c8ac4e536ceb6921b04021a94ea/href

On the following step we define a class for the convolutional neural network (LeNet):

https://medium.com/media/e2a653b1555cf2dc58bf10ac1eed3931/href

To perform cross-validation, we will import the function ‘cross_val_score’ from Sklearn. This function takes the classifier, the samples (X), the labels (y), and the number of folds (cv) as inputs:

https://medium.com/media/33772f4514c65f1374a653ea211a771c/href

After running cross-validation, the function returns a list of accuracies for the five folds. In order to know how it performs on average, we look at the mean and the standard deviation:

As a side note, this metric measures the percentage of data samples properly classified, or being more precise, a proportion of correct predictions with respect to the samples. In the background, Keras classifies each sample, yields a vector with the probability for each class, and selects the highest value as the model prediction. Finally Keras compares the prediction against the true value

After running cross-validation for 5 folds, and extracting the mean accuracy and the standard deviation, we have a more accurate assessment of the model’s performance and of how robust it is on average. We can see that the classifier achieves on average 90% accuracy. This value fluctuates from iteration to iteration with a standard deviation of roughly 0.4%. We can conclude that we have a low variance and a relatively low bias. Still, we encourage you to benchmark different CNN architectures with the Fashion-MNIST dataset: https://github.com/zalandoresearch/fashion-mnist#benchmark

Hyper-Parameters Optimisation

When creating a neural network we have two types of parameters. There are the parameters that are learned during training, such as weights, and the parameters that will be hard-coded and optimised separately. This second class of parameters are called hyper-parameters. Examples being the dropout rate, learning rate, number of epochs, batch size, optimiser, number of layers, number of nodes, etc.

Fine-tuning the hyper-parameters might improve predictions, but there isn’t a rule that will tell you how many layers, the number of epochs, or the batch size to use on your neural network.

Finding an optimal solution (or even a sub-optimal solution) often involves repeatedly training different versions of a model to different sets of hyper-parameters. However, there are still a few techniques for hyper-parameter optimisation. One of the simplest is grid-search.

Grid-Search

Grid-Search is a popular method for identifying the best hyper-parameter set for a model. It’s a brute force search method that takes a set of possible values for each hyper-parameter we want to tune. The machine evaluates the performance for each combination. In the end, it will return the selection with the best performance.

It’s a time-consuming method, but since we can have multiple local minima and acceptable solutions for hyper-parameters, (and it’s easy to end working with a suboptimal set of combinations), a grid-search might improve your model’s performance. In a real scenario, we start with a broad and wide-ranging set of values for each hyper-parameter. After identifying which values and region to explore, we can reduce the range on the grid-search.

Grid-Search Example

We will carry on working with the sample code we were using before, but modified so we can pass a set of values for the learning rate, the epochs, and the batch size.

Even though we are using a low number of folds for the cross validation (the cv parameter) and a modest number of values for the 3 hyper-parameters to reduce waiting times, it will still take over an hour to run the following configuration. We recommend to set n_jobs to -1 to run the grid search in parallel on all processors:

https://medium.com/media/fa3ff3c10d2872c5702cac1738dee3dd/href

After running the grid-search we can get the best combination of parameters, the best score, and a dictionary with each combination’s results:

In the end, the best combination turned out to be the one we had previously used to run the cross-validation.

Conclusion

In this article, we went through some basic ideas on how to evaluate a machine learning model and how to fine-tune its hyper-parameters.

We explored a very common strategy for model evaluation, k-fold cross-validation, which, by averaging individual model performance, will identify whether a model suffers from high bias or high variance.

We also explored how to search for the right set of hyper-parameters, by applying grid-search. Exhausting all possible hard-coded combinations that will help you to find the optimal combination of hyper-parameters.

If you have any thoughts or questions about model evaluation or hyper-parameter optimisation please feel free to get in touch.

Key Takeaways

Accuracy measured across a training and a testing dataset is the fastest most frequently used metric for assessing a model, but it’s not the best.
Whatever metric you use to measure performance, you need to ensure you keep bias and variance as low as possible.
If a model is overtrained or too complex, you will see high variance. If a model is too inflexible and produces similar errors in both training and testing data, you will see high bias.
Attempts to reduce one often increases the other. A compromise is needed, known as a bias-variance tradeoff.
Average performance measured over k-fold cross-validation gives a proper estimate of how well a model does its task in general.
Neural networks have two types of parameters; those learned during training, and those that are hard-coded and optimised separately. The latter are hyper-parameters.
Optimising hyper-parameters often requires repeatedly training different versions of a model to different sets of hyper-parameters.
Grid-search is one of the simplest techniques for hyper-parameter optimisation.

Machine Learning Model Evaluation and Hyper-Parameter Tuning: Looking Beyond Accuracy was originally published in Empathy.co on Medium, where people are continuing the conversation by highlighting and responding to this story.

Gradient Descent: An Algorithm for Deep Learning Optimisation

David Lourenço Mestre — Tue, 05 Feb 2019 08:46:36 GMT

Optimisation algorithms, used to minimise a cost function against training data, are the cornerstone of many mathematical methods, from simple curve fitting to deep machine learning.

Within the realm of machine learning, gradient descent and its variants are undoubtedly between the most popular optimisation algorithms. In this article we want to explore the mathematical foundations of gradient descent alongside the practicalities of its implementation — using sampling code running in TensorFlow. Stay tuned.

Cost Function

Let’s start with a typical classification problem like training a model to assign labels to data. For this we’ll use a neural network that will be trained with a set of pre-labelled data. One of the key ingredients during the supervised learning process is the cost function that has to be optimised.

The cost function for a neural network depends on the weights and biases that we want to shape, and it measures how well the model performs with respect to the training dataset. The most common cost function can be expressed as a sum of the squared differences:

where ŷ, or ‘y hat’, represents the prediction of the neural network, y^i the value from the data point i and the sum runs over m data points in the training set.

We now have to identify the weights and biases that minimise the cost function. At the beginning of the training we would have set random values for our parameters and, through a process of back propagation and iteration, we can adjust the weights and biases to reduce the value of the cost function. This is where gradient descent enters.

But before going any further, why gradient descent? Why do we need to optimise the cost function in the first place? Why not run lots of possible and random weights?

We could try out thousands of values for the weights on the cost function and see which ones work best. While this could work for one weight, once we increase the number of layers and synapses we can’t escape the problems that arise within a high-dimensional space. It becomes computationally infeasible to get the minimum value of the cost function in an acceptable amount of time.

Let’s assume that we’re working with the following neural network that has 16 weights:

To try 10,000 values for each weight would imply to run 10⁶⁴ combinations (10,000¹⁶). In a real case scenario, we would have hundreds or thousands of weights. Not very efficient right.

Gradient Descent

Raschka, Sebastian. Python Machine Learning, Packt Publishing, p. 36

So, what is the gradient descent? At a theoretical level, gradient descent is a first order iterative method for finding the minimum of a function, and thus is very well suited to use with continuous and convex cost functions.

In each iteration the weights of the neural network are modified by a value proportional to the negative gradient of the cost function at the current point, to get closer to the function minima:

Where the w’ on the left side is the vector of neural network weights after a learning step and ∇J(w) is the first-order derivative, the gradient. The proportionality constant 𝜼 is normally referred to as the learning rate.

The equivalent expression for the weight of a single neuron w_j is:

Where we now have the partial derivative of the cost function.

For our simple cost function, we can perform the calculation explicitly in the simplest case of a linear transformation:

However, in most situations the gradient will be calculated numerically.

The learning rate

The learning rate will determine how many iterations will be needed to adjust the parameters with respect to the cost function, because it controls how fast the weights are updated at each interval. Smaller rates will take longer to converge to a minima. Too large a rate may lead to high fluctuations around a minima, or diverge from the optimal solution.

Non-convex error functions

A word of advice — the gradient descent works well for a convex function, but if the surface of the cost function that we’re trying to minimise is not convex, the gradient descent may be stuck in a local minima without ever reaching the global minima (this is the optimal solution for the relationship between cost function and weights and biases).

We can visualise it as a mountain range with different valleys. Independently of the number of iterations, the gradient might end up moving around the same valley, without necessarily reaching the lowest point. Or, the gradient may converge to another stationary point that we’re not interested in, a saddle point, which can be visualised as a plateau in our mountain example.

Code Implementation

Now the fun part, let’s see how to run the gradient descent inside a neural network. For simplicity, we’ll import from scikit-learn the Iris dataset, which is perfect for testing purposes, and will split the dataset into two sets; one for training, and another for testing:

https://medium.com/media/aa3fe1c1a6fd9bc3ddf08193178dfdfb/href

Since we’re implementing a neural network we should normalise the data so let’s transform the data with MinMaxScaler:

https://medium.com/media/c5da4ec3b7428f8924c57228447f1fb5/href

As good practice, we should vectorise each label (if we don’t do this the model might conclude that the labels have some kind of hierarchy):

https://medium.com/media/5d00c4e16e0ff4a489d64a6d478d765f/href

In the following step we initialise two variables, one for the number of data points, the other for the number of labelling classes, and at this point we pre-define the two tensors as placeholders (upon execution, those placeholders will be fed with data):

https://medium.com/media/683869b7671750d3e0f41a03ebccff23/href

We now need to pass some key parameters. These are the learning rate that we mentioned before, the number of epochs to iterate the gradient descent (the back-propagation process), a regularisation factor to avoid overfitting, and the number of nodes for each network layer.

https://medium.com/media/8a4231be43fdbd856b5640f94e9c1a08/href

In a real case-scenario we would have to find the optimal values of the parameters for our neural network.

After we’ve defined the tensors for the weights and the biases:

https://medium.com/media/bf30d15db3292c957cd3f68cbf1e3643/href

We can now build the neural network and define the cost function. The prediction (y_hat) will be computed as the output of the last fully connected layer.

https://medium.com/media/d95dfae0dcaa31267a4282204f445604/href

As the cost function we have used the mean squared error, as described before, but in a real application of a classification problem like this one other functions like the cross entropy/softmax would be more appropriate.

The real action, and the most time-consuming element lies in the training loop:

https://medium.com/media/e88ecbb6466cc719237d131b7f8762a8/href

Beyond Gradient Descent

To avoid being stuck in undesirable stationary points or dealing with computationally expensive matrixes, there are more advanced methods to run or optimise the gradient. To run the gradient let’s look at Stochastic GD and Mini Batch GD, while to see how to optimise the gradient we can use Adam (Adaptive Moment Estimation).

Stochastic Gradient Descent

SGD may be considered an approximation of the GD, where each iteration is evaluated on a single data sample instead of on the whole training set. Obviously, finding the best fit parameters based on a single sample reduces the overhead costs significantly for large datasets, and it tends to reach convergence faster. The main disadvantage is that the journey to find a global minima tends to be more erratic and noisy.

Mini-Batch

We can see mini-batch as a compromise between the simple gradient descent and the stochastic version. Instead of running just a training sample or the full dataset to minimise the error function, it uses a small subset of the data, chosen randomly. Thus taking the best of both methods.

Adam

Adam, which is not an acronym, derives its name from adaptive moment estimation. As with the above methods, it’s a first order gradient but in contrast to them it allows the learning rates of each of the network weights to be adapted following the learning process itself. It can be seen as a fusion between two other popular methods: AdaGrad (Adaptive Gradient Algorithm), well suited for sparse data, and RMSProp (Root Mean Square Propagation), which works well for online and non-stationary problems.

Conclusion

In this article we’ve introduced some of the key concepts of gradient descent:

We saw how the gradient descent method minimises the cost function to ensure the convergence to a global minima,
We worked through a mathematical example of a simple linear transformation to show how to go from a cost function to the gradient descent,
We’ve seen that the learning rate defines how fast or slow the algorithm moves towards the optimal weights values,
We’ve gone through a code sample that illustrates how to integrate gradient descent with a neural network.

For anyone planning to explore machine learning, having a grasp on how to minimise a cost function will be essential, and we hope that this article has helped to explain and explore some of the core ideas behind gradient descent.

Gradient Descent: An Algorithm for Deep Learning Optimisation was originally published in Empathy.co on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Solve the Common Problems in Image Recognition

David Lourenço Mestre — Fri, 14 Sep 2018 07:57:15 GMT

How to Solve the Common Problems in Image Recognition

Introduction

Most classification problems related to image recognition are plagued with well-known and established problems. For example, frequently there won’t be enough data to properly train a classification system, the data might have some underrepresented classes, and most commonly, working with unscrutinised data will imply working with poorly labelled data.

Data is the key that determines whether your efforts will fail or succeed. These systems don’t just need more data than humans to learn and distinguish different classes, they need thousands of times more to do the job.

Deep learning relies on enormous amounts of high-quality data to predict future trends and behaviour patterns. The data sets need to be representative of the classes that we intend to predict, otherwise, the system will generalise the skewed classes distribution, and the bias will ruin your model.

These problems normally will share a common cause; the ability to find, extract, and store large quantities of data, and on a second level, cleanse, curate, and process that data.

While we can increase computing power and data storage capacity, one machine will not stand a chance when running a complex and large convolutional neural network against a large data set. It might not have enough space and, most likely, will not have enough computing power to run the classification system. It will also require access to parallel/distributed computing through cloud resources, and to understand how to run, organise and set complex clusters.

Yet, having enough data and the power to process is not enough to prevent these problems.

In this post, we’ll explore and discuss different techniques that can address the problems that arise when working with small data sets, how to mitigate class imbalance, and how to prevent over-fitting.

Transfer Learning

Data might be the new coal, quoting Neil Lawrence, and we know that deep learning algorithms need large sets of labelled data to train a fully-fledged network from scratch, but we often fail to fully comprehend how much data that means. Just finding the amount of data that meets your needs might be an endless source of frustration, but there are some techniques, such as a data augmentation or transfer learning, that will save you a lot of energy and time to find data for your model.

Transfer learning is a popular and very powerful approach which in short can be summed up as the process of learning from a pre-trained model that was instructed by a larger data set. That means leveraging an existing model and changing it to suit your own goals. This method involves cutting off the last few layers of a pre-trained model and retraining them with your small data set. It has the following advantages:

It creates a new model over an older one with verified efficiency for image classification tasks. For example, a model can be built upon a CNN architecture such as Inception-v3 (a CNN developed by Google) and pre-trained with ImageNet;
It reduces the training time as it allows the reuse of parameters to achieve a performance that could take weeks to reach.

Unbalanced Data

Often the proportion of a group of labels inside a data set versus the others can be unbalanced and it’s often that this minority group of labels is the set of categories that we are interested in precisely for its rarity. For example, suppose we have a binary classification problem, class X represents 95% of the data and class Y the other 5%. Thus, the model is more sensitive to class X and less sensitive to class Y. As the classifier reaches an accuracy of 95% it will basically predict class X every time.

Clearly accuracy here is not an appropriate scoring. In this situation, we should consider the cost of the errors, the precision, and the recall. A sensible starting point is a 2-D representation of the different types of errors, in other words, a confusion matrix. In the context of the outcome of our classification, it can be described as method to illustrate the actual labels versus the label prediction, as illustrated in the below diagram.

By storing the number for each label of true positives, false positives, true negatives and false negatives acquired from the model’s predictions, we can estimate the performance for each label using recall and precision. Precision is defined as the ratio:

Recall is defined as the ratio:

Recall and/or precision will disclose an underlying problem, but not solve it. However, there are different methods to mitigate the problems associated with a marked imbalance in the distribution of classes:

By assigning distinct coefficients to each label;
By resampling the original dataset, either by oversampling the minority class and/or under-sampling the majority class. That said, oversampling can be prone to over-fitting as classification boundaries will be more strict and small data sets will introduce bias;
By applying the SMOTE method (Synthetic Minority Oversampling Technique) which alleviates this problem replicating the data of less frequent classes. This method applies the same ideas behind data augmentation and makes it possible to create new synthetic samples by interpolating between neighbouring instances from the minority class.

Over-fitting

As we know our model learns/generalises key features on a data set through backpropagation and by minimising a cost function. Each step back and forth is called an epoch, and with each epoch the model is trained and the weights are adjusted to minimise the cost of the errors. In order to test the accuracy of the model, a common rule is to split the data set into the training set and the validation set.

The training set is used to tune and create the model that embodies a proposition based on the patterns underlying in the training set, the validation set tests the efficiency and validation of the model based on unseen samples.

Albeit the change on the validation error for a real case tends to show more jumps and downs:

At the end of each epoch we test the model with the validation set, and at some point the model starts memorising the features in the training set, while the cost error and the accuracy for the samples on the validation set gets worst. When we reach this stage, the model is overfitting.

Selecting how large and complex the network should be will be a determinant cause for overfitting. Complex architectures are more prone to overfitting but, there are some strategies to prevent it:

Raising the number of samples on the training set; if the network is trained with more real cases it will generalise better;
Stopping the backpropagation when overfitting happens is another option, which implies checking the cost function and the accuracy on the validation set for each epoch;
Applying a regularisation method is another popular choice to avoid overfitting.

L2 Regularisation

L2 regularisation is a method that can be used to reduce the complexity of a model by assigning a constraint to larger individual weights. By setting a penalty constraint we decrease the dependence of our model on the training data.

Dropout

Dropout is a common option too for regularisation, it’s used on the hidden units of higher layers, so that we end up with different architectures for each epoch. Basically, the system randomly selects Neurons to be removed during the training. As a consequence, by constantly rescaling the weights the network is forced to learn more general patterns from the data.

Conclusion

As we’ve seen there are various different methods and techniques to solve the most common classification problems in image recognition, each with their benefits and potential drawbacks. There are problems such as Unbalanced Data, Over-fitting, and quite frequently there will not be enough data available but, as we’ve explained their effect can be mitigated with transfer learning, sampling methods, and regularization techniques.

This is an area that we continue to explore as we develop our own Imaginize image recognition technology. This new product feature has been designed to help our eCommerce customers improve the classification, tagging and findability of their products through being able to automatically identify and recognise colours and categories.

How to Solve the Common Problems in Image Recognition was originally published in Empathy.co on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Challenging Dimensions of Image Recognition (2º part)

David Lourenço Mestre — Mon, 28 May 2018 07:49:56 GMT

Machine Learning and Artificial Intelligence have been gaining traction and popularity over the last few years, and beyond Google’s tech scene. As eCommerce continues to grow at such an electric pace, we see more and more features, from recommendations and search ranking to alert systems, that are driven by deep learning and artificial intelligence technologies.

One of the most interesting and challenging features is within object detection and recognition and, in particular in regards to the well-known problems associated with Deep Learning/Machine Learning problems such as data curation, data pre-processing, data collection, model performance, memory errors and so on.

In my last post, we started to look at some of the complexities around image recognition, and we´ll continue to explore this area here by examining how we can tackle some of these issues, as well as some of the principles behind Deep Learning.

Neural Networks

So, what are neural networks, and what is this deep learning all about? How can they recognize these features and patterns? Let´s use an illustrative example.

The above image is a famous perceptual illusion. Depending on the features your brain opts to select, the features it processes, you’ll see either a vase, or two faces looking at each other. Your brain will classify the image as either one or the other. A trained convolutional neural network works along these identical principles; it will slice the image and look for certain features and patterns that it was trained to identify.

To a certain extent classic Artificial Neural Networks mimic how the brain works. More specifically, they mimic the concept behind the inter-neuron connections. A neuron receives signals from many adjacent neurons upstream, if the input signals surpasses its activation threshold, it transmits the signal. These are some of the building blocks behind an Artificial Neural Network. We have highly interconnected processing layers composed of artificial neurons, where each one receives input values from different “synapses”. Each connection has a weight assigned, and if the sum of all the weighted values/connections surpasses the activation function then the neuron fires its output to the next connected neuron.

Unlike the brain, the weights are assigned randomly when the process starts. During the learning stage the weights are updated and optimised through backpropagation to minimise the error, and while doing so, the predictions converge to the sample labels that have been used to train the model. This process goes back and forth until the ANN (or CNN) reaches the expected output and the model captures the relations within the data. That’s how an artificial neural network learns and identifies common patterns and attributes within each class.

Data Augmentation

After an extensive AI winter, Deep Learning erupted once more due to a massive increase in computing power and a huge amount of data. However, despite our era of pervasive big data, there are large limiting factors in accessing reliable data. For example, how to find the right type of data to suit a well-defined tree of classes, and importantly, how to have a symmetrical number of images per class. While some classes might have enough data, some will not meet the criteria and will be underrepresented.

And a crucial point, the images may not abstract/outline all of the elements that should represent each class. The data available has to characterize the diversity and general features, patterns and attributes, that we might find in real cases.

For example, let’s imagine that we have a small corpus of structured data with two classes: “Jackets” and “Sweatshirts.” Now, let’s assume that each Jacket is large enough to occupy almost the full image allocation, while the Sweatshirts compose just a tiny fraction of each image. After dividing the original dataset into a training set and a testing set we apply our fancy deep-learning tricks and reach an accuracy of 97%, hurrah!

What if we then, however, feed the recently trained model with a Sweatshirt that nearly occupies the whole picture, and the model says it’s a Jacket. But wait, didn’t we get an accuracy of 97%. So why we would get such a result?

Deep-learning, at its core, is really just looking for the set of features and patterns that best represent each class, and, in this case, the area occupied by the item of clothing is one of the main features. In the end, any model is as good as the data used to feed it. A high-quality data set is crucial, and as we know, the more (and diverse) data a DL algorithm has access to, the more effective it can be.

To guarantee that we have good general properties, and that the model ignores the noise and irrelevant features, we may choose to augment the data using the existing data set. There are different techniques to do this, such as rotating, scaling, flipping, changing the lighting conditions, or cropping. For our two labels dataset, applying different scales would certainly improve the model’s general properties. While data augmentation consistently leads to improved generalisation, it will not replace a good dataset from the beginning.

Parallelism

While working with Deep Learning there are different libraries and frameworks available: Deep Learning Pipelines, Keras, and TensorFlow, are examples. We decided to look at TensorFlow (TF), being one of the most mature options available. TF is a low-level API with a steep learning curve, it was developed by the researchers and engineers at the Google Brain, and open-sourced in 2015.

One of the first problems we came across, and we already knew to expect it, was that we would not be able to fit our training tasks into the memory, so it would require days or weeks, to finish a simple training task on a regular CPU. We would have to distribute TF and assign the graph across different machines. Albeit TF offers a native solution, making it work can be as fun as a toothache! It requires manually managing a cluster of machines, enabling gRPC protocol, configuring manually the devices, and so on.

In 2017 Yahoo open-sourced TensorFlowOnSpark significantly reducing the pain with the many pros that Spark brings; data integration through RDDs, S3 and HDFS incorporated within the pipelines, an almost effortless integration with GPU and CPU clusters on-premise or on the cloud. Nevertheless, to make it work still requires a slow process of trial and error, and there are still some hurdles to overcome.

For example, it doesn’t come with a cluster manager or a distributed I/O layer out of the box so users have to manually set up a distributed TF cluster and feed data into it. This also means it comes with all the problems of having 300 error messages on Yarn logs when you just missed a typo on the code! It’s also not easy to identify the correct configuration on the command line, and there are some slightly obscure steps such as having to install libhdfs.so on all machines. Still any TF program can be modified to work with TensorFlowOnSpark.

TensorFlowOnSpark will launch distributed model training as a single Spark job, and automatically feeds data from Spark RDDs or DataFrames into the TensorFlow job. The cluster architecture can be divided into three kinds of nodes:

A PS (Parameter Server) node that hosts the models
A master worker that coordinates the training operations
Workers responsible for managing the training steps

We initially worked with clusters using CPU machines but, in the end, we decided to work exclusively with GPUs instances as even a single GPU machine offers impressive results. GPUs can be seen as a cluster of multiple computational units, and its specific capabilities can be exploited to further speed up calculations.

Conclusions

It’s not a trivial process to tune neural network settings, there are many hyperparameters, for example, activation functions, learning rate, batch size, momentum, number of epochs, different types of regularization, and furthermore, the layers of the networks can vary in number, type, and width. The fine-tuning therefore requires extensive bookkeeping and, even if there are different methods to find out the optimal configuration, the large scope of potential combinations, especially when working with a low number of machines, makes the operation not for the faint of heart.

There’s a high computational time and memory usage. Even a fairly small dataset will require expensive machines, cloud or on-premise, and some cloud GPU machines might fail when working with large widths and heights.

The data used to train a model has to be nearly perfect, meet exceptionally comprehensive and high quality standards, and finding, or creating, an acceptable dataset is often the biggest hurdle. Working with an incomplete or poor dataset will be an endless source of frustration.

In the next post, we’ll continue this exploration by looking at Faster R-CNN, and how we can use it to detect fashion items.

The Challenging Dimensions of Image Recognition (2º part) was originally published in Empathy.co on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Challenging Dimensions of Image Recognition (part 1)

David Lourenço Mestre — Fri, 02 Mar 2018 09:06:20 GMT

As consistent product data greases the wheels of eCommerce, the inconsistency that emerges from dealing with a large number of fashion items raises different challenges. How to organise and standardise the product data received from different retailers? Or, how to supply the missing information and ensure accuracy across distinctive catalogues?

For example, the colour for the same jacket may be classified by Retailer A as “salmon”, “light-red” by Retailer B, and Retailer C might use an abbreviation such as “RD”. Also the descriptive category/name for the same jacket might vary across the suppliers.

To save us from the human effort required to curate and clean data, one method is to use automated solutions that standardise information through robust tools to clean up the data received from different sources and extract attributes and categories from fashion items.

There are however problems to overcome by using this method, for example if a random image contains clothing, how do you predict the clothing type through multi-class classification, how to you predict the colours, and find attributes such as strips, types of sleeves and collars?

For a human eye, it might be an easy task to identify a huge range of objects and detect features such as colour or to distinguish a coat from a shirt. To recognise patterns and similarities comes naturally to us but for a computer, image recognition is still a significant challenge.

And while working with convolutional neural networks, it´s also important to cover classical architectures to know where this takes us.

Image Segmentation

Optimising a solution for fashion classification is a fundamental vision problem. There are multiple techniques, first it´s important to work with image segmentation, and use this approach to analyse an image based on the abrupt changes in the homogeneity attributes of pixels. Image segmentation should be used as a first step in image analysis and recognition.

It´s also essential to note that when processing images, it´s key to focus on the features that represent the dominant object. However, this is not always as simple as it sounds as each image may well have a background of some sort; a plain white backdrop, or a complex setting such as a street scenario.

To address this problem, there are many different methods available for image segmentation, one option is to look for a light solution. Watershed is a good background removal solution, and one of the most common algorithms used for image segmentation. It uses the process of extracting information from an image using groups of pixels with regions of similarity.

Starting from user defined markers, the Watershed algorithm treats the grayscale image as if it was composed of “high” and “low” areas. The algorithm floods the “low” areas using the values set on the markers. The values above the threshold markers are then used to extract the dominant object from the image, by creating an alpha channel that reflects the proportion of foreground and background.

Original Image

Alpha Channel

However, Watershed doesn´t always work well for images with fairly complex backgrounds.

Another solution is to work with GrabCut. Albeit slower than Watershed, it performs well on complex backgrounds. GrabCut tends to be used as an interactive foreground extraction tool, but it can be tweaked to work on an autonomous fashion. When applying it, it´s fair to make the assumption that the product would be centred within the image.

By setting a box over an image, the algorithm defines everything outside of the box as a known background, and the data inside it is classified as unknown. The machine does an initial classification based on the entry values and tries to estimate as to which class the unknown pixels belong. Through an iteration process, GrabCut applies a probabilistic function to identify the probable foreground and the probable background. The process is repeated until the classification converges.

Clustering

An image can be viewed as a large array of discrete pixels. For an image encoded with three channels (Red, Green and Blue), each pixel represents a colour. Therefore, clustering is a good method to extract the colours from a fashion article and to divide and group similar pixels.

After all, when we as persons define something as Red or Blue, we are also clustering specific wave lengths into similar groups, and categorising those groups with a symbolic name.

To cluster image data, one technique is to work with an unsupervised machine learning algorithm such as k-means. K-means relocates centroids to minimise the sum of squared distances between the centroids and the points that form each pixel.

The process runs k-means with a number of centroids equal to the number of colours, where each centroid is assigned to a k-cluster. K-means´ algorithm is then executed for each pixel on the image. K-means clustering finds the Euclidean distance for each pixel to the cluster mean.

After some iterations, the centroid moves from a random position to a local optimum (the centre of the cluster), in a process that recalculates the centre for all clusters on each iteration. The outcome of the centroids positions as a proxy can be used for the colours that will define the palette.

The challenge with k-means is that it´s necessary to specify the number of clusters before running the algorithm, and estimating the number of clusters from an array of RGB values can be computationally expensive and a slow procedure. And even though there are some methods to assess the number of clusters, such as Elbow or Silhouette, none offer a wholly accurate estimation for the ideal number of clusters.

Conclusions

While attaining respectable results, GrabCut and clustering have a slow performance, due to the iterative refinement stage on GrabCut, as well as having to determine the ideal number of clusters beforehand, which is a problem on its own. This makes both methods problematic when used to process large catalogues. In the next post, we´ll continue this exploration by looking at deep learning, and how we can use this method to automatise some of our tools.

The Challenging Dimensions of Image Recognition (part 1) was originally published in Empathy.co on Medium, where people are continuing the conversation by highlighting and responding to this story.