TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Developing an AI-based Android app for image annotation

7 min readMar 17, 2021

--

Image by author

Project Definition & Summary

Big data and emerging technologies have a profound impact on our everyday lives, which is hard to overlook. Specifically, the amount of visual information — such as photos and social media images — has grown exponentially over the recent years, motivating the development of a software that can efficiently handle high-volume image collections. Image tagging is known as the one of common computer vision problems, implying the process of generating textual tags based on visual content of images. Thus, the ability to automatically retrieve relevant textual labels of images allows to automate image labeling (e.g. online catalogs), organize high-volume image content (e.g. photo managers) and benefit the existing image search and sharing applications (e.g. recommendation engines).

In order to address these problems, an Android application — ImageTagger — was developed to automatically generate relevant textual descriptions of images. For that, a combination of deep learning and classical machine learning techniques were applied. The feature extraction was implemented using pre-trained convolutional neural network (CNN) and relevant tags were defined using non-parametric Gaussian naive Bayes classifier. Here, a general purpose image classification dataset Tiny ImageNet was used to generate image embeddings using MobileNet V2 and the performance of the latter was evaluated using the accuracy metric, which is widely used for multi-class classification problem. The merit of such approach is that there is no need for a hyper-parameter tuning, as the chosen classifier has none.

The resulting Android app allows users to upload an image that they would like to have described and to get relevant tags for the given image. Moreover, the ImageTagger makes a user possible to update existing and to add new tags to an image. Over time, the system learns tags from the new input images in an online manner based on the explicit user’s feedback and gets better at predicting descriptions on its own.

Methodology

Utilizing computer vision for embedding generation

Here, a combination of deep learning and classical machine learning techniques have been applied. The overall analysis can be split into two steps, namely feature extraction and classification. The feature extraction part was implemented using deep convolutional neural network (CNN) and the labels to be assigned were calculated using non-parametric Gaussian naive Bayes model. Figure 1represents the image processing and computer vision pipelines needed to perform Bayesian inference on previously unseen data. First, the image is being passed through a CNN that produces a corresponding image embedding x. Then, per-class statistics (mean μ and standard deviation σ) are computed and stored into the database.

Figure 1. Data processing and computer vision components of ImageTagger

Results

Learning concepts over thousands of images

In order to generate image embeddings, a general purpose image classification dataset Tiny ImageNet was used. The dataset is a smaller version of full ImageNet containing more than 100k images uniformly distributed across 200 classes belonging to diverse categories, such as animals, devices, clothes, and others. Thus, the usage of Tiny ImageNet data allowed the CNN model to learn representations and descriptions of images from different knowledge domains.

For the purposes of the project, a pre-trained on the ImageNet dataset CNN model was used. In order to unify image representation — shapes and color distributions, — input images were scaled and center cropped to the size of 256 x 256 px and also standardized using full ImageNet’s statistics in a per-channel manner. The aforementioned procedure held for training as well as for the evaluation step.

Figure 2. 2D-projection of normalized image embeddings

Figure 2 shows a low-dimensional representation of embeddings of images belonging to ten randomly-selected Tiny ImageNet classes and generated using MobileNet V2 model. The qualitative analysis clearly indicates that semantically similar entities are also close in the embedding space, resulting that the associated images (e.g. food-related) are scattered together in the projection surface (e.g. espresso, ice cream and pizza are in the lower left of the figure).

Generating relevant tags for the input image

Next, based on the pre-computed statistics extracted from image embeddings, class probabilities are calculated using Gaussian naive Bayes classifier (Eq. 1) and the latter are being filtered keeping only those that are largest and sum up to a total probability of at least 0.9. If the user modified any of assigned tags, the post-inference correction is done. In this case, the application database containing per-class statistics is being updated using Welford’s online algorithm (Eq. 2, 3) with momentum term α=0.1.

Model Evaluation & Justification

Four network architectures were considered as possible backbone for the feature extraction step: ResNet-18, DenseNet-121, MobileNet V2, ShuffleNet V2. In order to choose the most suitable CNN model to generate image embeddings, the model performance was evaluated (Figure 3) on the Tiny ImageNet dataset using the accuracy metric that provides the ratio of the correct predictions out of the total number of predictions derived using the model. Since the authors of the dataset provide the official training and validation data splits, there was no need to use cross-validation approach. As a result, MobileNet V2 was chosen for the further analysis as this architecture demonstrates the best performance.

Figure 3. Performance of CNN models on the Tiny ImageNet dataset

Size of the model serves as an indirect measure of its computational complexity and easy-to-compute neural architectures are preferable as the model is to be deployed on mobile devices. In order to access the size of models, the amount of trainable parameters was calculated as shown in Figure 4. Results indicate that ShuffleNet V2 architecture has the smallest model size, followed by MobileNet V2, DenseNet-121 and ResNet-18.

Figure 4. Comparison of the amount of trainable parameters and performance of CNN models

Finally, the MobileNet V2 model architecture was chosen based on the combination of its accuracy and size, as it provides the highest performance (Fig. 3), while offering less number of trainable parameters.

Refinement — Accelerating a network using quantization

As deep neural networks are quite computationally expensive, the inference speed usually becomes a bottleneck, especially on mobile devices. To address this issue and increase the model’s throughput, the model quantization was applied to MobileNet V2. The approach refers to the decrease of computational complexity through the use of a low-precision arithmetic performing calculations based on half-precision floating-point (float16) or integer (int8) numbers. Such optimization allowed to substantially reduce the time needed to perform the feature extraction step (from 212.0 +/- 12.6 ms to 50.1 +/- 9.1 ms on Nokia 7.1) for the cost of a marginal decrease of the validation accuracy (from 56.09% to 55.67%).

Designing Mobile App Architecture

Following the development of the computer vision components, ImageTagger’s UI was designed (Figure 5) and implemented, as demonstrated later on. On the left, there is an overview of the initial picture selection process, that results in passing of the chosen picture into a private image collection. The user can upload an existing on the device image or take a photo. On the right, the flowchart describes the acquisition of pictures from the image collection and displaying them to the user. The latter can choose one of them to get an automated description and to perform editing of the assigned labels.

Figure 5. User interactions flowcharts of ImageTagger

Live Demo of ImageTagger

Here, how the final Android app looks like (Fig.6):

Figure 6. Live demo: ImageTagger for Android

Conclusions & Future Work

Here, the application ImageTagger for Android was developed to automatically generate relevant textual annotations of user images. The further accuracy improvement can be achieved considering fine-tuning of the CNN on a target (e.g. MS COCO) or training it on a larger dataset (e.g. YFCC100M). Additionally, the model speed can benefit by the usage of modern devices that support GPU-based inference.

Future developments can be done by the introduction of between-user photo sharing using client-server or peer-to-peer technologies (e.g. torrent, blockchain) allowing to perform distributed image search and retrieval in a secure and privacy-preserving manner.

Getting ImageTagger on your device

In order to launch the ImageTagger on your mobile device, follow the checklist and ensure that you take all the steps to install it successfully:

  1. Download ImageTagger from GitHub to your Android device
  2. Open the app on your Android device
  3. Enjoy the automatic image tagging

Note: Mobile device is expected to run Android 8.1 and above; installation of APKs from Unknown Sources should be allowed.

Concerned on privacy and security? Suspicious about installing apps from unknown sources? Just clone the project repository and build the ImageTagger on your own using Android Studio 4.1 or newer.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Sofya Lipnitskaya
Sofya Lipnitskaya