How to Fine-Tuning LLMs on custom data? codes explained
We have talked a lot about LLMs and Langchain in the past blogs. It’s high time we talk about the elephant in the room i.e. how actually to use these LLMs for real-world business use cases. For this, one major step is to fine-tune the pretrained LLms on custom datasets that the company owns. How to do it? We will walk through it today. But before that, let’s have a few jargon-cleared
- Pretraining in machine learning involves training a model on a large general dataset to understand language or patterns before adapting it to specific tasks. Pretraining initializes the model’s understanding of the data and its structure.
- Fine-tuning is a process where a pre-trained model is further trained on a smaller, task-specific dataset. This adjustment helps the model specialize in a particular task, making it more accurate and effective in handling specific types of data and questions.
READ FULL BLOG HERE: vteam.ai/blog/posts/how-to-fine-tuning-ll-ms-on-custom-data-codes-explained
So, without wasting any time, let’s get started. We would be training flan-t5-small on summarization data (json format) which has
- id: document id
- dialogue: Conversation between two people
- summary: Summary of the conversation
More information about the dataset can be found here: https://huggingface.co/datasets/samsum
So, we would be fine-tuning the model on a summarization task. The final training should look something like this
Find the codes here : vteam.ai/blog/posts/how-to-fine-tuning-ll-ms-on-custom-data-codes-explained
Originally published at https://vteam.ai.