Is Your Data Ready to Fine-Tune your Business Specific Large Language Models?

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

4 min readFeb 19, 2024

Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer

For the last year I had many conversations about fine-tuning and using Retrieval-Augmented Generation (RAG) to provide context to Large Language Models (LLMs). We have proofed how it is possible to fine-tune Falcon and Llama within Snowflake. We even have a QuickStart for that purpose. In order to implement RAG, we have shown how to use vector datatype and embeddings within Snowflake and Build an End-to-End RAG Application using Snowflake Cortex. Using Snowpark Container Services, you can also run a vector database like Weaviate within Snowflake keeping all your data secured within the same Platform.

All these are great as experiments, but the big question is about the data you are using to fine-tune those models that will be specific to your industry and business. Do you have the right data foundation? Is that data clean and usable? My colleagues Harini and Zohar had a good approach for summarizing clinical trial protocols with an interesting dataset for training, and they showed a RAG based approach for clinical trial assistant. During my conversations with System Integrators and clients, I find consensus in creating specific industry models tailored to specific customer needs is the direction to take. Many have heard my joke about “I do not want a LLM to do my kids homework rather a LLM (maybe not so large) that understand my business, my products, my clients and can make the right recommendations”. But the big question is whether you have the right data foundation to either fine-tune or create the right context with RAG to build those specific industry models.

This paper from Microsoft RAG vs Fine-Tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture discussed RAG vs Fine-Tuning but also the pipeline consisting on multiple stages, including extracting information from PDFs, generating questions and answers, using them for fine-tuning, and leveraging GPT-4 for evaluating the results. Without the first step of acquiring and processing the right dataset, other steps are irrelevant. While it discussed the usage of tools like BeautifulSoup (which reminds me of some work I did many years ago to classify ski resorts from a Spanish website to have some fun) my expectation will be the usage of specific proprietary documents and data sets and not just data from the web.

We can see how having the right approach for chunking your documents will be a key decision. The paper discusses several tools to extract the right information from those documents. A combination of different tools will be needed. In this blog entry we showed a simple way to chunk PDFs to ask questions to your own documents using Snowflake Cortex and how effective RAG is when providing the right context, but better techniques and more work will be needed in order to implement the right RAG strategy. This can be really variable depending on the use case. You can chunk at sentences, paragraphs, documents, or even specific character cutoffs. You may want to pre-process some documents first with tools like Snowflake DocumentAI to extract specific information. And that’ll be a point of optimization for RAG.

Snowflake engineering is working in building the most powerful SQL Large Language Model in the world. In their four-part blog they describe some of the challenges they found in the way and how going from solved academic benchmarks to real databases is a complex task. The approach of treating an evaluation data set as data living in a database, just as any other data making life easier is quite interesting. They show some of the results of evaluations:

This provided a systematic approach for data generation and evaluation that provided continued improvements for their models.

Therefore, while we may be thinking about RAG vs Fine-Tuning, the key question is what data foundation we have to feed our LLM models. As many say, there is no AI Strategy without a Data Strategy.

Snowflake System Integrators will help you to get ready to embark on your GenAI journey making sure you are ready for success!

Carlos.-

Is Your Data Ready to Fine-Tune your Business Specific Large Language Models?

Written by Carlos Carrero