Advancing Large Language Models on Intel Platforms
The Evolution of Intel NeuralChat-7B LLM
Kaokao Lv, Zhenwei Liu, Wenxin Zhang, Haihao Shen, and Hanwen Chang, Intel Corporation
We released Intel NeuralChat-7B v3–1 to Hugging Face and achieved top 7B LLM in November 2023:
We continue improving NeuralChat-7B and have released new versions with better performance:
In this article, we describe the evolution of Intel NeuralChat-7B on Intel Gaudi2 and present the latest benchmarking results. We also provide instructions to accelerate inference performance through 4-bit quantization on Intel Xeon processors.
The Evolution of Intel NeuralChat on Gaudi2
We perform a series of experiments to improve Intel/neural-chat-7b-v3–1 model performance. We summarize the fine-tuning as:
- Continuous fine-tuning on diverse instruction datasets
- Alignment of direct preference optimization (DPO)
- Hyperparameter tuning for LoRA fine-tuning
- Model combination
Now, let’s dive into the details.
NeuralChat-7B v3–1
This is a fine-tuned model based on mistralai/Mistral-7B-v0.1 trained on the open-source dataset Open-Orca/SlimOrca and DPO-aligned with our released preference dataset Intel/orca_dpo_pairs.
NeuralChat-7B v3–2
We did continuous fine-tuning with the mathematics dataset meta-math/MetaMathQA based on Intel/neural-chat-7b-v3–1. We also merged our fine-tuned DPO LoRA weight that contains [‘q_proj’, ‘k_proj’, ‘v_proj’] modules. In the table below, the biggest improvement is in the gsm8k metric, which increases from 40 to 55.
NeuralChat-7B v3–3
The difference between Intel/neural-chat-7b-v3–3 and Intel/neural-chat-7b-v3–2 is the merged DPO LoRA modules, which are [‘k_proj’, ‘v_proj’] and [‘q_proj’, ‘k_proj’, ‘v_proj’], respectively. The table below shows how the adjustment of LoRA modules for Intel/neural-chat-7b-v3–3 benefits gsm8k and truthfulqa_mc.
NeuralChat-7B v3–3-Slerp
For the model Intel/neural-chat-7b-v3–3-Slerp, we applied spherical linear interpolation (Slerp) that can combine the advantages of two or more models by interpolating the weights of different models. As we can see from the table, the gsm8k metric has been improved significantly.
INT4 Inference on Xeon
Intel Extension for Transformers provides an efficient runtime to accelerate LLM inference through the state-of-the-art model compression techniques such as 4-bit quantization. The following sample code uses neural-chat-7b-v3–1 model to perform INT4 inference on an Intel Xeon processor.
Here is the result:
Besides INT4 inference, BF16 inference and INT8 inference are also available to accelerate the LLM inference using Intel AI software such as Intel Neural Compressor and Intel Extension for PyTorch on Intel platforms. Users can decide the appropriate inference approach given the accuracy and performance target.
Ethics Statement
We developed Intel NeuralChat and made both the code and model available to the community for commercial use, along with the dataset. Throughout the fine-tuning stage, we endeavored to mitigate risks such as hallucination, toxicity, and other ethical concerns, but like other LLMs, NeuralChat is not free from such issues. Additionally, we conducted low-precision quantization to accelerate inference, ensuring that the quantized model had a negligible impact compared to the baseline. Our goal is to collaborate with the community to enhance these aspects, making AI a positive force for everyone, everywhere.
Summary
We are excited to release Intel NeuralChat, a commercially accessible 7B chat model, to LLM community. The NeuralChat model exhibits superior performance in standard benchmarks for generative language models. We anticipate that NeuralChat will extend the boundaries of deploying 7B chat models and inspire additional researchers and developers to collaborate in democratizing open-source LLMs.