Developers’ Hands-on | Segment Anything Quantitative Acceleration

Published in

OpenVINO-toolkit

7 min readJul 21, 2023

Author: Ethan Yang

1. Background

“Segment Anything, We All Get Unemployed!” — Recently, such a phrase has gone viral on social media! This is referring to the Segment Anything Model (SAM). What is SAM exactly? What functions does it possess? Is it really that powerful? Let’s find out in this article!

SAM is a powerful artificial intelligence image segmentation application developed by Meta AI Lab. It can automatically identify which pixels in an image belong to an object and perform automatic stylistic processing on different objects in the image. It can be widely used for analyzing scientific images, editing photos, and more.

SAM’s complete application consists of an image encoder model and a mask decoder + prompt encoder model, both of which can be interpreted as separate static models, image encoder takes the major computing workload during inference. Therefore, improving the execution efficiency of the image encoder becomes one of the main optimization directions for SAM applications.

In this blog, we will focus on demonstrating how to achieve quantization compression of the SAM encoder using the OpenVINO™ NNCF model compression tool to improve performance of inferencing on CPU.

2. Quantization Introduction

Before we dive into the practical implementation, we must mention the concept of quantization. Quantization refers to mapping the expression range of model parameters from FP32 to INT8 or INT4 without changing the model structure. It represents the same information with smaller value bit-width, achieving compression of the model size and reducing memory consumption. During the execution process of the model network, the system automatically calls specialized hardware platform instructions or kernel functions optimized for low-bit data, improving performance.

Figure 2: Representation bit-width of different precision data

Intel AVX512 VNNI extension instructions compress the INT8 matrix multiplication and addition operations, which originally required three clock cycles, to one clock cycle. In the latest AMX instruction set, multiple VNNI modules are stacked to achieve a multiple-fold performance improvement within a single cycle.

Figure 3: Instruction set optimization for INT8 matrix multiplication and addition operations

3. NNCF Post-Training Quantization Mode

NNCF, short for Neural Network Compression Framework, is a solution implementation within the OpenVINO™ toolkit specifically designed for model compression and acceleration. It includes various model compression algorithms such as quantization, pruning, and binarization. The usage of NNCF can be categorized into two modes: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). While QAT requires the original training script and dataset, PTQ allows direct compression of the trained model file without the need for additional training scripts and labeled datasets. This is a new feature introduced by NNCF in the OpenVINO™ 2023.0 release. PTQ can be achieved through the following two steps:

1.Prepare a calibration dataset. In the quantization process, the calibration data is used solely for calculating data range and distribution and does not require additional labeled data. For example, in image recognition tasks, approximately 200–300 image files can be used. Additionally, a DataLoader object and transform_fn data conversion function need to be defined. The DataLoader is responsible for reading each element of the calibration dataset, while the transform_fn is used to convert the read elements into direct input data for OpenVINO™ model inference.

import nncf
calibration_loader = torch.utils.data.DataLoader(...)
def transform_fn(data_item):
    images, _ = data_item
    return images
calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)

2. Run model quantization. First, import the model object and then bind the model object with the calibration dataset using the nncf.quantize() interface to initiate the quantization task. NNCF supports various model object types, including openvino.runtime.Model, torch.nn.Module, onnx.ModelProto, and tensorflow.Module.

model = ... #OpenVINO/ONNX/PyTorch/TF object
quantized_model = nncf.quantize(model, calibration_dataset)

3. (Optional) Accuracy control mode. If the exported model by NNCF in the default mode shows a higher than expected decrease in accuracy, accuracy control mode can be used for post-training quantization. In this case, a labeled test dataset is required to evaluate the sensitivity of each layer's impact on model accuracy loss during the quantization process. The layers are then gradually reverted to their original precision based on the evaluation, until the model achieves the desired accuracy. This mode allows for compressing the model size while maintaining model accuracy. For specific methods, please refer to the following link.

4. Segment Anything + NNCF Practical Application

Next, let’s take a step-by-step look at how to use NNCF’s PTQ mode to complete the quantization of the SAM encoder.

Project can be found here.

1. Define the data loader

In this example, the coco128 dataset is used as the calibration dataset, which includes 128 .jpg format images. Since the data loader must be a torch DataLoader class when quantizing ONNX or IR static models, we need to inherit torch.utils.data.Dataset and reconstruct a dataset class that includes the getitem method for iterating over each object in the dataset, and the len method to get the number of objects in the dataset. Finally, a DataLoader is generated using the torch.utils.data.DataLoader method.

class COCOLoader(data.Dataset):
    def __init__(self, images_path):
        self.images = list(Path(images_path).iterdir())

    def __getitem__(self, index):
        image_path = self.images[index]
        image = cv2.imread(str(image_path))
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        return image
    
    def __len__(self):
        return len(self.images)
    
coco_dataset = COCOLoader(OUT_DIR / 'coco128/images/train2017')
calibration_loader = torch.utils.data.DataLoader(coco_dataset)

2. Define the data format conversion module

The next step is to define the data conversion module. We can use the previously defined preprocess_image function to preprocess the data. It’s worth noting that since the calibration_loader module returns a single data object in the torch tensor format, and the OpenVINO™ Python interface does not support this data type, we need to convert it to the numpy format first.

def transform_fn(image_data):
    image = image_data.numpy()
    processed_image = preprocess_image(np.squeeze(image))
    return processed_image

calibration_dataset = nncf.Dataset(calibration_loader, transform_fn)

3. Run NNCF quantization

To ensure the accuracy of the quantized model, we use the original FP32 ONNX format model as the input object instead of the FP16 IR format model. Then, the model is passed into the nncf.quantize interface for quantization. This interface has several important additional parameters:

● model_type: Model type used to enable special quantization strategies. For example, for transformer models, we need to prioritize model accuracy.

● preset: Quantization mode. The default mode is PERFORMANCE, which uses symmetric quantization for both weight and bias of convolutions to improve model performance. In this case, we use the MIXED mode to achieve a balance between model accuracy and performance. It uses symmetric quantization for weights and asymmetric quantization for biases, suitable for models that include non-ReLU or non-symmetric activation layers.

# Load FP32 ONNX model
model = core.read_model(onnx_encoder_path)
quantized_model = nncf.quantize(model,
                                calibration_dataset,
                                model_type=nncf.parameters.ModelType.TRANSFORMER,
                                preset=nncf.common.quantization.structs.QuantizationPreset.MIXED)
ov_encoder_path_int8 = "sam_image_encoder_int8.xml"
serialize(quantized_model, ov_encoder_path_int8)

Since the SAM encoder model has a complex network structure and the quantization process requires traversing the parameters of each layer multiple times, the quantization process may take longer. It is recommended to use hardware devices with more than 32GB of memory. If memory is insufficient, you can reduce the number of calibration data by setting the subset_size parameter to 100.

4. Model accuracy comparison:

Next, we compare the inference results of the INT8 and FP16 models:

Figure 4: Prompt mode FP16 vs. INT8 comparison results

Figure 5: Auto mode FP16 vs. INT8 comparison results

It can be seen that in both prompt and auto modes, the INT8 model shows almost no change in accuracy compared to the FP16 model.

Note: In auto mode, masks are displayed in randomly generated colors.

5. Performance comparison:

Finally, we compare the performance indicators using the benchmark_app tool provided by OpenVINO™:

It can be found that on the CPU, the INT8 model achieves approximately a 30% improvement compared to the FP16 model, and the model size is reduced from around 350MB to less than 100MB.

5. Conclusion

Given the outstanding automatic segmentation capability of SAM, it is expected that there will be more and more application scenarios where this technology will be deployed. During the industrialization process, developers often focus on striking a balance between performance and accuracy to obtain a more cost-effective solution. OpenVINO™ NNCF tool achieves significant improvements in model runtime efficiency and reduces model space occupation by quantizing and compressing a part of the Segment Anything encoder without significantly impacting model accuracy.

Notices & Disclaimers

Intel technologies may require enabled hardware, software or service activation.

No product or component can be absolutely secure.

Your costs and results may vary.