The Challenging Dimensions of Image Recognition (2º part)

Published in

Empathy.co

7 min readMay 28, 2018

Machine Learning and Artificial Intelligence have been gaining traction and popularity over the last few years, and beyond Google’s tech scene. As eCommerce continues to grow at such an electric pace, we see more and more features, from recommendations and search ranking to alert systems, that are driven by deep learning and artificial intelligence technologies.

One of the most interesting and challenging features is within object detection and recognition and, in particular in regards to the well-known problems associated with Deep Learning/Machine Learning problems such as data curation, data pre-processing, data collection, model performance, memory errors and so on.

In my last post, we started to look at some of the complexities around image recognition, and we´ll continue to explore this area here by examining how we can tackle some of these issues, as well as some of the principles behind Deep Learning.

Neural Networks

So, what are neural networks, and what is this deep learning all about? How can they recognize these features and patterns? Let´s use an illustrative example.

The above image is a famous perceptual illusion. Depending on the features your brain opts to select, the features it processes, you’ll see either a vase, or two faces looking at each other. Your brain will classify the image as either one or the other. A trained convolutional neural network works along these identical principles; it will slice the image and look for certain features and patterns that it was trained to identify.

To a certain extent classic Artificial Neural Networks mimic how the brain works. More specifically, they mimic the concept behind the inter-neuron connections. A neuron receives signals from many adjacent neurons upstream, if the input signals surpasses its activation threshold, it transmits the signal. These are some of the building blocks behind an Artificial Neural Network. We have highly interconnected processing layers composed of artificial neurons, where each one receives input values from different “synapses”. Each connection has a weight assigned, and if the sum of all the weighted values/connections surpasses the activation function then the neuron fires its output to the next connected neuron.

Unlike the brain, the weights are assigned randomly when the process starts. During the learning stage the weights are updated and optimised through backpropagation to minimise the error, and while doing so, the predictions converge to the sample labels that have been used to train the model. This process goes back and forth until the ANN (or CNN) reaches the expected output and the model captures the relations within the data. That’s how an artificial neural network learns and identifies common patterns and attributes within each class.

Data Augmentation

After an extensive AI winter, Deep Learning erupted once more due to a massive increase in computing power and a huge amount of data. However, despite our era of pervasive big data, there are large limiting factors in accessing reliable data. For example, how to find the right type of data to suit a well-defined tree of classes, and importantly, how to have a symmetrical number of images per class. While some classes might have enough data, some will not meet the criteria and will be underrepresented.

And a crucial point, the images may not abstract/outline all of the elements that should represent each class. The data available has to characterize the diversity and general features, patterns and attributes, that we might find in real cases.

For example, let’s imagine that we have a small corpus of structured data with two classes: “Jackets” and “Sweatshirts.” Now, let’s assume that each Jacket is large enough to occupy almost the full image allocation, while the Sweatshirts compose just a tiny fraction of each image. After dividing the original dataset into a training set and a testing set we apply our fancy deep-learning tricks and reach an accuracy of 97%, hurrah!

What if we then, however, feed the recently trained model with a Sweatshirt that nearly occupies the whole picture, and the model says it’s a Jacket. But wait, didn’t we get an accuracy of 97%. So why we would get such a result?

Deep-learning, at its core, is really just looking for the set of features and patterns that best represent each class, and, in this case, the area occupied by the item of clothing is one of the main features. In the end, any model is as good as the data used to feed it. A high-quality data set is crucial, and as we know, the more (and diverse) data a DL algorithm has access to, the more effective it can be.

To guarantee that we have good general properties, and that the model ignores the noise and irrelevant features, we may choose to augment the data using the existing data set. There are different techniques to do this, such as rotating, scaling, flipping, changing the lighting conditions, or cropping. For our two labels dataset, applying different scales would certainly improve the model’s general properties. While data augmentation consistently leads to improved generalisation, it will not replace a good dataset from the beginning.

Parallelism

While working with Deep Learning there are different libraries and frameworks available: Deep Learning Pipelines, Keras, and TensorFlow, are examples. We decided to look at TensorFlow (TF), being one of the most mature options available. TF is a low-level API with a steep learning curve, it was developed by the researchers and engineers at the Google Brain, and open-sourced in 2015.

One of the first problems we came across, and we already knew to expect it, was that we would not be able to fit our training tasks into the memory, so it would require days or weeks, to finish a simple training task on a regular CPU. We would have to distribute TF and assign the graph across different machines. Albeit TF offers a native solution, making it work can be as fun as a toothache! It requires manually managing a cluster of machines, enabling gRPC protocol, configuring manually the devices, and so on.

In 2017 Yahoo open-sourced TensorFlowOnSpark significantly reducing the pain with the many pros that Spark brings; data integration through RDDs, S3 and HDFS incorporated within the pipelines, an almost effortless integration with GPU and CPU clusters on-premise or on the cloud. Nevertheless, to make it work still requires a slow process of trial and error, and there are still some hurdles to overcome.

For example, it doesn’t come with a cluster manager or a distributed I/O layer out of the box so users have to manually set up a distributed TF cluster and feed data into it. This also means it comes with all the problems of having 300 error messages on Yarn logs when you just missed a typo on the code! It’s also not easy to identify the correct configuration on the command line, and there are some slightly obscure steps such as having to install libhdfs.so on all machines. Still any TF program can be modified to work with TensorFlowOnSpark.

TensorFlowOnSpark will launch distributed model training as a single Spark job, and automatically feeds data from Spark RDDs or DataFrames into the TensorFlow job. The cluster architecture can be divided into three kinds of nodes:

A PS (Parameter Server) node that hosts the models
A master worker that coordinates the training operations
Workers responsible for managing the training steps

We initially worked with clusters using CPU machines but, in the end, we decided to work exclusively with GPUs instances as even a single GPU machine offers impressive results. GPUs can be seen as a cluster of multiple computational units, and its specific capabilities can be exploited to further speed up calculations.

Conclusions

It’s not a trivial process to tune neural network settings, there are many hyperparameters, for example, activation functions, learning rate, batch size, momentum, number of epochs, different types of regularization, and furthermore, the layers of the networks can vary in number, type, and width. The fine-tuning therefore requires extensive bookkeeping and, even if there are different methods to find out the optimal configuration, the large scope of potential combinations, especially when working with a low number of machines, makes the operation not for the faint of heart.

There’s a high computational time and memory usage. Even a fairly small dataset will require expensive machines, cloud or on-premise, and some cloud GPU machines might fail when working with large widths and heights.

The data used to train a model has to be nearly perfect, meet exceptionally comprehensive and high quality standards, and finding, or creating, an acceptable dataset is often the biggest hurdle. Working with an incomplete or poor dataset will be an endless source of frustration.

In the next post, we’ll continue this exploration by looking at Faster R-CNN, and how we can use it to detect fashion items.

The Challenging Dimensions of Image Recognition (2º part)

Published in Empathy.co

Written by David Lourenço Mestre

No responses yet