Streamlining Model Optimization as a Service with Intel Neural Compressor

Introducing Our New Neural Solution Component

Published in

Intel Analytics Software

4 min readJun 21, 2023

Yi Liu, Kaihui Tang, Sihan Chen, Liang Lv, Feng Tian, and Haihao Shen, Intel Corporation

In today’s fast-paced world of deep learning (DL), model compression techniques play a critical role in enhancing efficiency and reducing computational requirements. Intel Neural Compressor (INC) is a cutting-edge tool that offers a wide range of popular model compression techniques, including quantization, pruning, distillation, and neural architecture search on mainstream frameworks. The tool validates thousands of models by leveraging the Neural Coder zero-code optimization solution and automatic accuracy-driven quantization strategies. Because model compression is usually compute-intensive and must be applied to hundreds or eve thousands of DL models, we often get requests for seamless integration into existing HPC systems-as-a-service. For example, serving and parallel task handling would enable efficient processing of multiple optimization requests. In this blog, we introduce Neural Solution, a new component in Intel Neural Compressor that offers model optimization as a service. Neural Solution simplifies the model quantization process and boosts execution efficiency in accuracy-aware tuning by running on multiple nodes.

What is Neural Solution?

Neural Solution provides task- and tuning-level parallelism, coordinating the optimization task queue and leveraging distributed tuning to speed-up the optimization process (Figure 1). It also offers a seamless integration interface, eliminating the need for repetitive environment setups and code adaptation, simplifying the optimization process for users. It automatically schedules the optimization task queue by coordinating available resources and tracking the execution status of each task. This concurrent scheduling ensures optimal resource utilization and allows for simultaneous execution of multiple optimization tasks.

One major challenge in model quantization is identifying the optimal accuracy-relative configuration, which is time-consuming. To enable faster turnaround, Neural Solution allows users to parallelize the tuning process across multiple nodes simply by specifying the number of workers in the task request. It also offers a convenient interface for seamless integration into different applications or platforms. It exposes both RESTful and gRPC APIs, empowering users to submit quantization tasks, query the optimization process, and obtain tuning results with ease.

Moreover, for the Hugging Face models, Neural Solution eliminates the need for any code modifications during the optimization process by seamlessly integrating the functionality of the Neural Coder. This approach significantly lowers the barrier to entry for users who may not possess extensive coding expertise.

Get Started with Neural Solution

Let’s start with an end-to-end example that quantizes a text classification model from Hugging Face.

Install Neural Solution

# get source code
git clone https://github.com/intel/neural-compressor
cd neural-compressor

# install Neural Solution
pip install -r neural_solution/requirements.txt
python setup.py neural_solution install

Note: More installation options and details can be found here.

Start the Neural Solution Service

# Start Neural Solution service with default configuration, log will be saved in the "serve_log" folder.
neural_solution start

# Start Neural Solution service with custom configuration
neural_solution start --task_monitor_port=22222 --result_monitor_port=33333 --restful_api_port=8001

# Help Manual
neural_solution -h

Submit Task

First, prepare a JSON file that includes request content.

[user@server hf_model]$ cd path/to/neural_solution/examples/hf_model
[user@server hf_model]$ cat task_request.json
{
    "script_url": "https://github.com/huggingface/transformers/blob/v4.21-release/examples/pytorch/text-classification/run_glue.py",
    "optimized": "False",
    "arguments": [
        "--model_name_or_path bert-base-cased --task_name mrpc --do_eval --output_dir result"
    ],
    "approach": "static",
    "requirements": [],
    "workers": 1
}

Next, submit the task request to the service. It will return the submit status and task ID for future use.

[user@server hf_model]$ curl -H "Content-Type: application/json" --data @./task.json  http://localhost:8000/task/submit/

# response if submit successfully
{
    "status": "successfully",  
    "task_id": "cdf419910f9b4d2a8320d0e420ac1d0a",
    "msg": "Task submitted successfully"
}

Query the Result

After the task has been successfully submitted, you can query its status and result using HTTP requests with the specified task ID. The output of the query is JSON format, which includes the optimization results and the path to the quantized model like the example below. You can download this optimized model through this path for deployment.

[user@server hf_model]$ curl  -X GET  http://localhost:8000/task/status/{task_id}

# return the task status
{
    "status": "done",
    "optimized_result": {
        "optimization time (seconds)": "58.15",
        "accuracy": "0.3162",
        "duration (seconds)": "4.6488"
    },
    "result_path": "/path/to/projects/Neural Solution service/workspace/fafdcd3b22004a36bc60e92ec1d646d0/q_model_path"
}

Stop the Service

neural_solution stop

This example shows the productivity benefits of Neural Solution. No coding is involved for the user when applying such optimizations, which quietly lower the barrier to doing model compression.

Conclusion and Future Work

Currently, Neural Solution offers users the convenience of using Intel Neural Compressor as a service, improving the user experience of applying advance model optimization technologies, like quantization and accuracy-aware tuning. In future, we plan to expand the Optimization-as-a-Service to support more optimization methods or recipes, like pruning, distillation, and smooth quant on large language models (LLMs). Additionally, we aim to facilitate deployment through Docker for greater adaptability. If you have any related suggestions or ideas, please contact us at inc.maintainers@intel.com.

Visit Intel Neural Compressor to learn more and get started.