Supervised Fine-Tuning and Direct Preference Optimization on Intel Gaudi2

Demonstrating a Top-Ranked 7B Chat Model on the LLM Leaderboard

Intel(R) Neural Compressor

Published in

Intel Analytics Software

4 min readNov 14, 2023

Kaokao Lv, Wenxin Zhang, and Haihao Shen, Intel Corporation

Intel Extension For Transformers provides robust support for cross-platform training and inference with a particular emphasis on Intel Gaudi2 accelerators, which are designed to expedite large language model (LLM) training and inference. In this article, we will provide a comprehensive walkthrough of the process to apply supervised fine-tuning and direct preference optimization (DPO) on Intel Gaudi2. We also present benchmarks that achieve comparable or even better results compared to other open-source LLMs of similar size published on the open LLM leaderboard.

Model: https://huggingface.co/Intel/neural-chat-7b-v3

Dataset: https://huggingface.co/datasets/Open-Orca/SlimOrca

Preference Dataset: https://huggingface.co/datasets/Intel/orca_dpo_pairs

Codebase: https://github.com/intel/intel-extension-for-transformers

Hardware

The Intel Gaudi2 AI accelerator was developed by Habana Labs for state-of-the-art deep learning training and inference. It has 96 GB of integrated memory and is available in servers containing eight Gaudi2 mezzanine cards via the Intel Developer Cloud or for on-premises infrastructure from Supermicro and IEI.

Training

To start the supervised fine-tuning, we select the latest mistralai/Mistral-7B-v0.1 Hugging Face as the base LLM model for two reasons: the pretrained model has strong benchmark results and it is commercially friendly under Apache 2.0.

Supervised Fine-Tuning

We select the latest high-quality instruction dataset Open-Orca/SlimOrca Datasets at Hugging Face and leverage the fine-tuning pipeline provided in Intel Extension for Transformers to perform training with DeepSpeed ZeRO-2. The fine-tuning code and training loss curve are shown below:

Direct Preference Optimization

We apply the DPO algorithm, which is stable and computationally lightweight, to better align with human preferences. DPO derives the probability of human preference data for the optimal policy to replace the reward model that reinforcement learning from human feedback needs and formulates a maximum likelihood objective for a parameterized policy. The preference dataset contains 12k examples selected from the Orca style dataset Open-Orca/OpenOrca. Its completions generated from GPT-4 or GPT-3.5 are regarded as the chosen response. We use the llama-2–13b-chat model to generate corresponding completions automatically because maybe better rejected responses are also better for alignment. Refer to Intel/orca_dpo_pairs and DPO example for more details on the dataset and DPO training code. The launch script is shown below:

Benchmark Results

We submitted our model to the open_llm_leaderboard, which uses the Eleuther AI Language Model Evaluation Harness, a unified framework to test generative language models on a large number of different evaluation tasks. Our model performed quite well:

NeuralChat-7b-v3 Ranked First on the 7B-sized LLM Leaderboard (November 13th, 2023)

Inference

NeuralChat is fully compatible with Transformers. Use the same launcher code with the model’s name “Intel/neural-chat-7b-v3” to perform the inference in FP32:

Follow these instructions to enable BF16 inference using Optimum-Habana to improve inference performance:

Ethics Statement

We created NeuralChat and released the code, model, and dataset to the community for commercial. We strived to address the risks of hallucination, toxicity, and other potential ethics issues during fine-tuning. Like other LLMs, NeuralChat is not free from such issues. We also carefully performed the low-precision quantization for inference acceleration and ensured the quantized model didn’t deviate too far from the baseline. We are hoping to collaborate with the community to improve these issues to make AI beneficial for everyone everywhere.

Concluding Remarks

We are excited to release NeuralChat, a commercially friendly 7B chat model, to the LLM community. This model shows higher performance than the original base model in typical generative language model benchmarks. We expect NeuralChat to help push the limits of 7B chat model deployment and motivate more researchers and developers to open their LLMs.

We encourage you to have a try and submit your fine-tuned model to the LLM leaderboard. Please give a star ⭐ to Intel Extension for Transformers repository if you find it useful. You are also welcome to create pull requests or submit issues or questions to the repository.