How to Fine-Tuning LLMs on custom data? codes explained

vTeam.ai
Data Science in your pocket
2 min readNov 22, 2023
Photo by Denisse Leon on Unsplash

We have talked a lot about LLMs and Langchain in the past blogs. It’s high time we talk about the elephant in the room i.e. how actually to use these LLMs for real-world business use cases. For this, one major step is to fine-tune the pretrained LLms on custom datasets that the company owns. How to do it? We will walk through it today. But before that, let’s have a few jargon-cleared

  • Pretraining in machine learning involves training a model on a large general dataset to understand language or patterns before adapting it to specific tasks. Pretraining initializes the model’s understanding of the data and its structure.
  • Fine-tuning is a process where a pre-trained model is further trained on a smaller, task-specific dataset. This adjustment helps the model specialize in a particular task, making it more accurate and effective in handling specific types of data and questions.

READ FULL BLOG HERE: vteam.ai/blog/posts/how-to-fine-tuning-ll-ms-on-custom-data-codes-explained

So, without wasting any time, let’s get started. We would be training flan-t5-small on summarization data (json format) which has

  • id: document id
  • dialogue: Conversation between two people
  • summary: Summary of the conversation

More information about the dataset can be found here: https://huggingface.co/datasets/samsum

So, we would be fine-tuning the model on a summarization task. The final training should look something like this

Find the codes here : vteam.ai/blog/posts/how-to-fine-tuning-ll-ms-on-custom-data-codes-explained

Originally published at https://vteam.ai.

--

--