Photo by Michael Dziedzic on Unsplash

Model Quantization Diagnosis with Neural Insights

A New Tool for Analyzing Neural Network Quantization

Intel(R) Neural Compressor
7 min readJul 6, 2023

--

Agata Radys, Suyue Chen, and Bartosz Myrcha, Intel Corporation

Model compression

One of the biggest challenges with deep learning model optimization is accuracy loss. It is hard to reduce the storage space and computations consumed by a model without losing accuracy. Nowadays neural models are used in an increasing number of applications, often with low-latency requirements, so optimizing the process of deep learning and limiting its quality loss becomes critical.

One of the most popular model optimization methods is quantization. This method works by approximating a floating-point model by a low-bit-width model. It significantly reduces storage and computation costs.

Quantization diagnosis

After model compression, we would like to check if the result is satisfying. The model’s performance should be compared to verify if it is faster to process the computations. We also want to check its accuracy to find out if our model did not lose its quality during compression.

To verify those parameters, we can use Neural Insights, a component of Intel® Neural Compressor. Intel® Neural Compressor performs model optimization while diagnosing the accuracy impacts of optimizations. Neural Insights is an application for visualizing the diagnosis data generated by Intel Neural Compressor. It compares the performance and accuracy of the model before and after optimization by showing profiling results, the model graph, ops details, and histograms of weights and activations. The workflow is shown in the diagram below.

First, we must configure some scripts to generate diagnostic information, then run them and check the results in the terminal. Second, we test if the result is satisfactory and repeat the steps, if needed.

Supported feature matrix

How to run quantization diagnosis with ONNX ResNet50

Install Neural Compressor

git clone https://github.com/intel/neural-compressor.git
cd neural-compressor
pip install -r requirements.txt
python setup.py install

Prepare the model

First you need to prepare the environment and download the ResNet-50 model. Link to example.

Install dependency packages

pip install -r examples/onnxrt/image_recognition/resnet50_torchvision/quantization/ptq_static/requirements.txt

Prepare the dataset

Download the ILSVR2012 validation ImageNet dataset and the labels:

wget http://dl.caffe.berkeleyvision.org/caffe_ilsvrc12.tar.gz
tar -xvzf caffe_ilsvrc12.tar.gz val.txt

Run the quantization script

If you want to use Neural Insights (GUI mode), you need to install this component before executing the quantization script, as described in the GUI mode section. Then execute the script with the quantization API in another terminal with — diagnose flag.

python examples/onnxrt/image_recognition/resnet50_torchvision/quantization/ptq_static/main.py \
--model_path=/path/to/resnet50_v1.onnx/ \
--dataset_location=/path/to/ImageNet/ \
--label_path=/path/to/val.txt/
--tune
--diagnose

Terminal (non-GUI) mode without Neural Insights

When Neural Insights module is not installed, the results are displayed in the terminal like in the example below.

In the activations summary you can see a table with OP name, MSE (mean squared error), activation minimum and maximum. The table is sorted by MSE.

In the weights’ summary table, there are parameters like minimum, maximum, mean, standard deviation and variance for both input and optimized models. This table is also sorted by MSE.

How to diagnose accuracy loss

Neural Compressor diagnosis mode provides weights and activation data that includes several useful metrics for diagnosing potential losses of model accuracy.

Parameter Descriptions

Data is presented in the terminal in the form of table where each row describes a single OP in the model. The measures are defined as below.

Mean squared error — it is a metric that measures the difference between input and optimized model weights for each OP:

Input model min — minimum value of the input OP tensor data:

Input model max — maximum value of the input OP tensor data:

Input model mean — mean value of the input OP tensor data:

Input model standard deviation — standard deviation of the input OP tensor data:

Input model variance — variance of the input OP tensor data:

where,
xᵢ— input OP tensor data,
yᵢ— optimized OP tensor data,
μₓ— input model mean,
σₓ — input model variance

Diagnosis suggestions

  1. Check the nodes by MSE order. High MSE usually means higher possibility of accuracy loss happened during quantization, so falling-back those Ops may recover some accuracy.
  2. Check the Min-Max data range. A dispersed data range usually leads to higher accuracy loss, so we can also try to fall back those Ops.
  3. Check with the other data, find outliers, try to fall back Ops and test for quantization accuracy.

Note: These debug rules are only a reference, sometimes accuracy regression is more complex.

Fallback setting example

from neural_compressor import quantization, PostTrainingQuantConfig 
op_name_dict = {'v0/cg/conv0/conv2d/Conv2D': {'activation': {'dtype': ['fp32']}}}
config = PostTrainingQuantConfig(
diagnosis=True,
op_name_dict=op_name_dict
)
q_model = quantization.fit(model, config, calib_dataloader=dataloader, eval_func=eval)

GUI mode with Neural Insights

Install Neural Insights

Neural Insights must be installed before executing quantization script. Full installation instructions can be found here.

pip install -r neural_insights/requirements.txt
python setup.py install neural_insights

Start the Neural Insights server

To start the Neural Insights server, run the neural_insights command. The server generates a self-signed TLS (Transport Layer Security) certificate and prints instructions on how to access the Web UI (User Interface):

eural Insights Server started.
Open address https://10.11.12.13:5000/?token=338174d13706855fc6924cec7b3a8ae8

Server generated certificate is not trusted by your web browser, you need to accept use of it. Before you add any workload, the page will be empty.

When script execution is done you can see model quantization details in the Neural Insights tool.

On the left you can analyze the visualization of the model graph. Nodes with plus signs are node groups, they can be expanded on click. When a regular node is clicked its attributes and properties are displayed on the left of the graph.

On the top right corner there is a summary of accuracy results. The original model (fp32) and optimized model (int8) and additionally the ratio between those two. In this case accuracy slightly improved after quantization.

Under the accuracy results there is a table with all the ops in the model sorted by MSE (mean squared error). Also, activation minimum and maximum values are displayed in the table. When an op row in the table is clicked, it is highlighted in the graph visualization and an additional window is displayed below the graph as shown below.

How to read the histograms

Activation histogram

Activation histograms show on Y axis the number of occurrences and on X axis activation value. Usually there are no negative values because of activation functions (for example ReLU (Rectified Liner Unit)) as shown in the previous image.

Weights histogram

Histograms show how often the values appear. On Y axis there is a scale showing the number of occurrences and on X axis the weights. Each histogram is made for a different channel in the layer. By looking at weight histograms you can detect vanishing (empty histogram) and exploding gradients (noticeably big histogram). That may indicate that the model is not learning properly.

Diagnosis suggestions

  1. Check the weights or activation histogram to see if the data layout is in a larger range, like the right graph shown below, which has a much larger min-max range than the left one. That means more data are converted into the same range of integers. That is a good point to try with fallback.

2. Check the weights histogram from channel level. You may find some different layouts between channels. In this case, try per-channel (for weights only) quantization, so each channel could have its own zero point and scale factor.

3. Examine the table data and histograms, looking for any outliers or abnormal layouts, and try falling back those Ops.

Config setting example

from neural_compressor import quantization, PostTrainingQuantConfig 
op_name_dict = {'v0/cg/conv0/conv2d/Conv2D': {'weights': {'granularity': ['per_channel']}}}
config = PostTrainingQuantConfig(
diagnosis=True,
op_name_dict=op_name_dict
)
q_model = quantization.fit(model, config, calib_dataloader=dataloader, eval_func=eval)

Future work

We plan to support more frameworks. For now, Neural Insights is available for TensorFlow and ONNX. Check out our Github repository for more information. If you have any suggestions or feedback, feel free to create a pull request, submit issues, or reach us by email: neural.insights@intel.com.

--

--