From automatic image cropping to face recognition and smart object detection Cloud services

In the last years we have experienced an explosive growth of mobile multimedia applications. In these applications, images are very important and play a more and more important role in sharing, expressing and exchanging information in our daily lives.

Accompanying this revolution, mobile handheld devices with different capabilities are undergoing a considerable progress because of their portability and mobility. Now we all can easily capture and share personal photos on these small-form-factor devices anywhere and anytime (that’s why the success of user-content generated platforms like flickr or Instagram).

However, many hurdles still need to be crossed. In particular, due to the limitation in size of mobile devices (which size exactly determines the term “mobile”), there is a need to develop new technologies to facilitate the browsing of large pictures on the small screens. In poor words, how to view best the images and how to automate as much as possible the browsing process into a small display?

There has been a lot of effort to get this done (starting from authors Laurent Itti, Christof Koch, Ernst Niebur) and still many related issues need improvement. I take advantage of the chance to thank especially my two Italian professors in Università degli Studi di Milano-Bicocca; Raimondo Schettini and Francesca Gasparini.

With the increasing technology of image recognition (see here my personal examples with AWS Recognition I tweeted in December 2016) and Artificial Intelligence features to automate labeling and identification of patterns, I found a good chance (and a kind of related retrospective) bringing back my past technical work in developing such algorithms. I hope this helps you understand the basics of how similar services work and what do they provide.

As I mentioned, the issue I try to illustrate is the following: adapting pictures to small displays. During one of my past University project (back in 2006) I studied some of the approaches for automatic browsing of large pictures on mobile devices and the aim was to introduce a new solution which best satisfies the user’s requisites. The algorithm developed was entirely implemented in Matlab (a well-known environment developed by MathWorks used for multi-purpose numerical analysis) and was tested on several sample images.

The key idea on which the proposed method is based is to follow a neurodynamic model of visual attention to drive the detection of the salient regions in an image.

In fact, it is well known that only part of a given scene is processed by our visual attention in full details, while the remainder is left relatively unprocessed.

An example of a visual attention mechanism is also when we, rapidly, direct and shift our gaze towards interesting parts of the visual input (which logically may be a face in the image, text, animals or simply a generic panorama…).

One of the aims of this work was to automatically browse different salient regions of the scene, simulating this particular visual shift.

The attention model used to provide the solution was the so-called saliency map based model (the most salient part of an image are evidenced according to our psycho-visual characteristics), which has been proposed as a computational model of focal visual attention.

From the saliency maps, depending on the typology of the input image, a strategy was developed to crop the image into a set of significant regions. These cropped sub-images are ordered with respect of their saliency.

In order to automate the entire process of browsing a visualization technique is called, the auto-cropping of these salient regions. A generic image is decomposed into a set of spatial information elements which are displayed serially to help users’ browsing or searching through the whole large image.

Definition of a visual attention model

The visual attention model for an image is defined as a set of attention objects.

An attention object (AO) is an information carrier that catches part of the user’s attention as a whole. An attention object often represents a semantic object in our mind, such as a human face, a mobile car, a penguin, a word, etc…

At this point, three attributes can be assigned to each attention object, which are region-of-interest (ROI), attention value (AV) and minimal perceptible size (MPS). Each of them is introduced in the following sections:

Region-Of-Interest

The notion of Region-Of-Interest (ROI), which was first introduced by JPEG 2000, is referred in our model as a spatial region or segment within an image that corresponds to an attention object. As it is shown in the illustration below, ROIs can be in arbitrary shapes (squares, triangles and rectangles…). The ROIs of different attention objects are also allowed to overlap. So, we can think of a ROI as represented by a set of pixels in the original image.

However, regular shaped ROIs can be denoted by their geometrical parameters instead of pixel sets for simplicity. For example, a rectangular ROI can be easily defined with the set of coordinates: {Left, Top, Right, Bottom} or {Left, Top, Width, Height}, while a circular ROI can be defined as {Center_x, Center_y, Radius}.

Because we know its position in the entire image, we can now make a decision about what region of interest to manipulate or simply visualize.

Attention value

Obviously different attention objects carry different amount of information (since they are even of different sizes), they are of different importance.

Therefore, the attention value (AV) is introduced, as an indicator of the weight of each attention object in contribution to the information contained in the original image.

To understand this we can think for example of an image containing a green valley and a sheep in the middle of it. It is clear that we are more interested in viewing the sheep and then the valley around it.

Minimal perceptible size

For image adaptation, it can be applied resolution scaling, spatial cropping, quality compression, color reduction. When fitting towards the small screen size, a natural and simple approach is: directly down-sampling images to reduce their spatial sizes, but much information will be lost due to the resolution reduction.

Obviously, the information of an attention object is significantly relying on its area of presentation. If an attention object is scaled down too much, it may not be perceptible enough to let users catch the information that authors intend to deliver (in the example above, to distinguish clearly the coffee cup , the girl face and the shop sign). Trying to avoid this problem we introduced the minimal perceptible size (MPS), to represent the minimal allowable spatial area of an attention object. The MPS is used as a threshold to determine whether an attention object should be sub-sampled or cropped during the adaptation.

Let’s suppose an image contains N number of attention objects, AOi for i = 1, 2 … N, where AOi denotes the ith attention object within the image. The MPS of AOi indicates the minimal perceptible size of AOi, which can be presented by the area of scaled-down region.

For instance, let consider an attention object containing a human face whose original resolution is 75x90 pixels. A developer of small screens services may determine its MPS to be 25x30 pixels which is the smallest resolution to show the face region without severely degrading its perceptibility (otherwise the face may not be recognizable anymore).

Saliency attention model

By analyzing an image, we can extract many visual features (including color, shape, and texture) that can be used to generate a saliency-based attention model. In addition, special objects like human faces and texts tend to attract most of user’s attention. In this section, I discuss some visual attention models that are used for modeling image attention and also a framework to integrate them.

A saliency-based visual attention model for scene analysis is introduced here. First of all must be generated the three feature maps, color contrasts, intensity contrasts, and orientation contrasts by using the approaches described afterwards and then built the final saliency map.

As we mentioned before in this thesis, the saliency attention is determined by the number of saliency regions, and their brightness, area and position in the gray saliency map, as shown below (the image with myself in front of a restaurant in Paris).

However, in order to reduce adaptation time, we binarize the saliency map to find the regions that most likely attract human attention. Since people often pay more attentions to the region near the image center, a normalized Gaussian template centered at the image can be used to assign the position weight.

In a clockwise rotation:

(a) Input image, (b) saliency map, © saliency map thresholded (d) erosion applied to the thresholding result, (e) connected-component labeling and (f) connected-component labeling with erosion applied.

Face attention model

Face is one of the most salient characters of human beings and the appearance of dominant faces in images certainly attracts viewers’ attention. It is not causality that faces are the objects that appear in the majority of our everyday photos. Therefore, face attention model should be integrated into the image attention model to enhance the performance. By employing a face detection algorithm (try the Google Vision API and Google Face Detection service), we may obtain face information and object identification within an image (the success of the process depends on the pose itself, position of the face, quality of the image etc…).

Of course, the basics for this saliency map algorithm have already been implemented and integrated in many visual processing projects (Machine Learning, Computer Vision, Augmented Reality) are being included in many Cloud Computing provider’ offerings, but the main reason I wrote this issue-related perspective here was to maintain high the level of attention regarding the evolution of AI services.

And personally, I think this kind of business trend is only in its starting point!