Chameleon: An intelligent and adaptable image segmentation pipeline

The computer vision technology behind our solution

Paul Jan Rupprecht
Merantix Momentum Insights
9 min readDec 11, 2020

--

Computer Vision (CV) has played an essential role in the latest developments within the field of Artificial Intelligence (AI). Until today, the applications for CV have grown exponentially ranging from medical over automotive to industrial applications. CV has become a part of our everyday’s life, so that we sometimes do not even recognize when and where the technology is deployed. Studies have shown that the CV market will reach a volume of USD 48.6 billion by 2022 (Forbes, 2019).

At Merantix Labs we just launched a highly adaptable and powerful CV solution focusing on image segmentation in various industries and functions: Chameleon. Before we enter the Chameleon enclosure, let us take the path step by step and start with some basics. Fundamentally, many different technologies have arisen in the field of CV. Image classification, object detection and image segmentation describe just a few exemplary technologies for major challenges in the field of CV. Even though all of these aim to imitate the human ability of visual perception by deriving meaningful information from digital images, some decisive differences between these technologies are briefly outlined below.

Computer vision technologies

Image Classification

Image classification describes the process of assigning a category to an image. An image classification algorithm learns the mapping from features extracted from an input image to the corresponding label and can therefore be categorized as a supervised Machine Learning (ML) technique. Generally, an image classification label can be defined as the corresponding answer tag (class) for a specific input image. Classification models can easily be extended for the purpose of assigning multiple different labels to a single image (multi-label classification).

Object Detection

Although classification algorithms can precisely classify input images, the model output does not contain any information regarding object location within the given image. Furthermore, classification models struggle to provide the necessary information when multiple objects of the same class are displayed in a single image. Thus, classifiers do naturally not allow counting operations within a given image (e.g. How many objects are visible within the image?). Object detection models allow the detection of objects within an image and predict their location by returning bounding boxes. These models output a list of corresponding bounding boxes framing the objects searched for, defined by their upper left corner coordinate and the box size. Consequently, the length of the list returned is equal to the number of objects detected within the given image. As a result, the predictions of an object detection algorithm include local information of the objects, however they lack specific information regarding their shape.

Image Segmentation

Image segmentation algorithms on the other hand, deliver a far more granular understanding of the entire image, as they provide information for every single pixel. These algorithms help to understand the global image scene. Each pixel has to be associated to an entity while simultaneously considering information of the entire image and the pixel’s neighborhood. Accordingly, the process of grouping pixels representing a specific object or category in a scene can be described as image segmentation. Within this process, the digital image is partitioned into various parts based on its characteristics.

Moreover, segmentation algorithms can further be divided into semantic and instance segmentation algorithms. While semantic segmentation involves detecting objects and grouping pixels according to a corresponding category, instance segmentation goes even further by detecting multiple independent instances of objects within a defined category. Distinguishing between single objects of the same category becomes possible.

Figure 1: Comparison of object detection and image segmentation methods (Langechuan, 2020)

The resulting segments either display the shape of single instances (instance segmentation) or the shape of a pre-defined category (semantic segmentation) and can be clearly distinguished from an object detection model result (bounding boxes). These main differences of prediction results for a given image are exemplified above (Figure 1).

Existing approaches to Image Segmentation

The challenge of segmenting images can be solved by training deep learning models. Images used for model training are annotated pixel-wise, which means that each pixel of the image is assigned a corresponding object label. Therefore, the labeling process for image segmentation models is naturally considered to be more expensive than for a classifier. Unlike an object detection algorithm, an image segmentation model learns to transform the input image and predicts an image mask that has the same size as the input image. The generated mask represents a matrix with one value per pixel containing the assigned category (semantic segmentation) or instance (instance segmentation). Accordingly, image segmentation can be understood as a pixel-wise classification task that allows a precise comprehension of the environment.

Fully Convolutional Network

J. Long et al. (2015) developed a Fully Convolutional Network (FCN) containing only convolutional layers that is trained end-to-end for image segmentation.

Figure 2: FCN architecture (Long et al., 2015)

The proposed network produces several feature maps with small sizes and dense representations within the first layers. Feature maps of a Convolutional Neural Network (CNN) generally capture the result of applying a convolutional filter to an image or to the previous layer, respectively. In order to output an image with the same size as the input image, a final upsampling step was proposed to ensure the pixel-wise network prediction.

U-Net

The FCN network was further improved by Ronneberger et al. (2015), initially for use in biological microscopy imaging. The authors have created a network called U-Net composed of two parts, namely a contracting part to compute image features and an expanding part to spatially localize patterns in the given image.

Figure 3: U-Net architecture (Ronneberger et al., 2015)

The former part of the given network is characterized as a VGG-like architecture where image features are extracted at multiple scales by 3x3 convolutions. The latter upsampling part of the neural network applies a transposed convolution strategy to increase the spatial height and width of the previous feature maps. Hence, these upsampling layers increase the spatial resolution of the output. U-Net can then be trained with a fairly small labelled data set using appropriate data augmentation techniques. The network output is a precise semantic segmentation map of the given input image.

DeepLabv3

Another widely used semantic segmentation architecture is DeepLabv3 that improves upon its own predecessors (DeepLab, DeepLabv2) by several modifications (Chen et al., 2017). The proposed model employs combined cascaded and parallel modules of atrous convolution with upsampled filters to extract dense feature maps and to capture long range context. The authors have modified the ResNet architecture in order to keep high resolution feature maps using atrous convolutions.

Figure 4: DeepLabv3 architecture — Parallel modules with ASPP (Chen et al., 2017)

The proposed atrous convolution modules are parallelized in the Atrous Spatial Pyramid Pooling module (ASPP). Finally, the concatenated ASPP outputs are processed using another 1x1 convolution leading to a pixel-wise output (segmented image).

Chameleon: An AI Solution to Detect, Localize and Classify Objects in your Images

Given the outlined research developments in the field of image segmentation and considering the wide application possibilities of CV in general, we have endeavored to develop a concise and highly adaptable image segmentation pipeline, named Chameleon. Chameleon is a precise, adaptable, and reliable image segmentation solution to analyze images and video footage.

How Chameleon works

Our developed pipeline is capable of conducting performance-enhancing preprocessing steps such as image normalization and dividing high resolution images into single patches. Powerful data augmentation techniques are applied in a semi-automated manner to artificially create variations of the images. Artificially expanding the dataset facilitates the development of robust high-performance models and reduces the risk of overfitting. Chameleon uses heuristics to semi-automate the task of selecting an appropriate pre-trained ML model from Chameleon’s model zoo. Our specially developed supplementary model zoo provides a range of proven image segmentation models (e.g. U-Net, DeepLabv3), adaptable to data in different dimensions. In addition to these widely applied image segmentation architectures Chameleon also includes other models optimized for specific domains and tasks (e.g. Detnet (Wollmann et al., 2019), GRUU-Net (Wollmann et al., 2019)). Problem-specific model hyperparameters are pre-configured autonomously for a given task.

The integration of an adaptable training component and a model ensemble strategy for combining various model outputs ensures prediction results at maximum performance level. Due to its flexible architecture, Chameleon outperforms alternative solutions as it is capable of handling extremely large images, long-tail distributions (class imbalance) and segmenting even smallest objects or fine structures.

Applications

Thus, its application scenarios range from visual quality control over medical imaging applications to face mask detection in the public space. As described , image segmentation allows a granular understanding of the objects an image contains. For some use cases (e.g. visual quality control: defective or non-defective part) it seems appropriate to add a classification downstream task in order to classify the input image accordingly. Chameleon can do that.

Due to the flexible interfaces and its expandable architecture, Chameleon can also be customized with other downstream tasks like instance segmentation, counting of objects, object tracking in video data or even volume estimation, which makes the solution even more powerful and so widely applicable in the field of CV.

Our solution is built using highly scalable infrastructure, with proven high-performance models and is quick and easy to set up. There are just four steps from your dataset to your individual and monitored Chameleon application:

Figure 5: Chameleon Workflow (Merantix Labs, 2020)

Integration

The user can upload any labeled or unlabeled image or video data that can be annotated with the provided labeling tool. By incorporating domain knowledge, the segmentation pipeline is customized to the specific data requirements and requested downstream tasks. One can generally decide between several deployment options. The first option is being provided an API with uptime guarantee that delivers the model predictions at any time. Another way Chameleon can be integrated is to run the containerized solution in an own cloud environment or on any on-premise data storage, which again emphasizes the flexible and adaptable character of the developed solution. Both options handle any processed data in a highly secure manner.

Conclusion

With the described solution we aim to establish a new standard in the field of vision based AI. The centerpiece of Chameleon, its semi-automated ML component, helps to find the right balance between standardization and customization. Accordingly, Chameleon is suitable for extremely hard problems (large images, class imbalance, fine structures …) but tackles a wide range of existing problems on the other hand. Our scalable segmentation pipeline is capable of processing images with high performance and just like a real chameleon, it can be adapted to any user’s need by adding specific downstream tasks.

Maybe it became clear now why we named it Chameleon.

References

Chen, L.-C., Papandreou, G., Schroff, F., & Adam, H. (2017). Rethinking Atrous Convolution for Semantic Image Segmentation. arXiv preprint arXiv:1706.05587.

Forbes. (2019, 04 08). 7 Amazing Examples Of Computer And Machine Vision In Practice. https://www.forbes.com/sites/bernardmarr/2019/04/08/7-amazing-examples-of-computer-and-machine-vision-in-practice/?sh=460b3a5a1018

Langechuan, P. (2020, 04 29). Single Stage Instance Segmentation — A Review A glimpse into the future of real-time instance segmentation. Towards Data Science. https://towardsdatascience.com/single-stage-instance-segmentation-a-review-1eeb66e0cc49

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition, 3431–3440.

Merantix Labs. (2020, 11 01). Chameleon: An AI solution to detect, localize, and classify your images. https://www.merantixlabs.com/

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. International Conference on Medical image computing and computer-assisted intervention, 234–241.

Wollmann, T., Gunkel, M., Chung, I., Erfle, H., Rippe, K., & Rohr, K. (2019). GRUU-Net: Integrated convolutional and gated recurrent neural network for cell segmentation. Medical image analysis, 56, 68–79.

Wollmann, T., Ritter, C., Dohrke, J. N., Lee, J. Y., Bartenschlager, R., & Rohr, K. (2019). Detnet: Deep Neural Network For Particle Detection In Fluorescence Microscopy Images. 2019 IEEE 16th International Symposium on Biomedical Imaging (ISBI 2019), 517–520.

--

--