LLM DP fine-tuning with Sarus in a Databricks workspace

Luca Canale
Sarus Blog
Published in
3 min readJul 30, 2024

Databricks provides an easy solution to specialize existing foundation models, and has published a series of tutorials showcasing their capacities. LLM’s fine-tuning nevertheless does not provide privacy guarantees on its own, see our recent post on GPT4 fine-tuning. Differential privacy fine-tuning should be used to provide such guarantees.

In this post, we show how seamless it is to use Sarus and Databricks to fine-tune an open-source model and never have to worry about leakage risk of the training set.

We consider the setup and data in the Databricks tutorial: we fine-tune a mistral 7-B on question/answers built on the Databricks documentation. The dataset is prepared following the Databricks tutorial, split into a train/test set and saved as a json.

Here, enters Sarus:

  • in a Databricks workspace, one has to define a yaml config for the model and the training parameters.
Configuration for fine-tuning: for memory and time efficiency, we train a quantized LoRA version of Mistral and leverage deepspeed for multi-gpu training.
  • in a notebook, just install and import the DPFinetuningRecipe from sarus-llm and run the training.

The loss can be followed on tensorboard, where standard metrics are logged along with the DP epsilon evaluation.

When training is done, sampling can be done following a similar procedure:

  • A yaml config has to be set to retrieve the checkpointed model and fix the sampling parameters such as the temperature or the top_k/top_p parameters.
  • The SampleRecipe has to be imported and executed from sarus_llm.

In the end, we fine-tuned a model without DP and one with DP (Epsilon 3, Delta 1e-5), then sample answers from the test set and reproduce for both the same evaluation as in the Databricks tutorial:, we use GPT4 to judge the quality of the answers on the test set based on the correctness and similarity to the actual ones (we use ML flow, again as in the Databricks tutorial).

As shown in the graph below, in both cases, we obtain in 90% of the cases the best grade both in terms of correctness and similarity. The mean utility as expected is slightly lower for the DP, but privacy is guaranteed!

--

--