Identifying Intents Efficiently with Serverless Fine-tuning

Fine-tuning is dead. Long live fine-tuning.

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

8 min read4 days ago

Here we go again. A mere 3 months after the release of Llama3, Meta have altered the game again with the release of their new suite of 3.1 models. The titan which is Llama3.1–405B will naturally grab the headlines, but Llama3.1–8b and Llama3.1–70b continue the ongoing trend of new highly efficient smaller models, enhancing the practicality LLMs for a broader set of use cases. We’re moving away from the AI-at-all-costs mentality of 12 months ago toward sensible value oriented applications.

Driven by efficiency, and a growing maturing across AI capabilities, the unsung hero of Generative AI is also having a moment of reemergence. Fine-tuning enables organisations to rapidly improve upon a foundational model’s accuracy for bespoke tasks — matching the capabilities of enormous models for a fraction of the cost. The serverless Cortex AI fine-tuning service goes that one step further, lowering the barrier to creating your own fine-tuning service — as well as making it even easier still to run your bespoke model upon completion. Here at Snowflake, we’ve committed to enabling all three Llama3.1 models across the Cortex AI layer — making it easier than ever before to introduce these powerful models to your bespoke knowledge base.

The blog will serve as your guide to fine-tuning, demonstrating Cortex AI’s user-friendliness, and enabling you to significantly cut down LLM inference costs.

Rather than frame this all around technology, let’s explore a use case that commonly benefits from fine-tuning: call center optimisation and intent analysis. During the early stages of my career, I found myself on a tour of various large call centers across the UK. The halcyon days of upgrading clients from Windows XP > Windows 7. Imagine, thousands upon thousands of agents, handling an insurmountable amount of calls, continuously. Mountains of highly valuable unstructured transcripts built on this flow of conversation between customers and business. Each agent is also meticulously monitored from calls answered, to time of conversation, to minutes lost to bathroom breaks. A data generating paradise, if nothing else.

Press 1 for help, press 2 to help yourself — powered by AI.

Impressive as they are, it’ll come as no surprise to anybody that’s recently attempted to find the location of a missing parcel - most businesses would rather not talk to you. For the most part - it’s not personal. In fact, I’m sure you’re wonderful - but in a high volume business, handling your call takes time — and time is money. From a business point of view (and occasionally customers), wouldn’t it be far better if you could solve your problems all on your own? No more awkward conversations with other human beings desperate to pee?

This is where intent comes in to the picture. If a company can predict and account for the reason you’re dialling in advance, deflecting to an automated system or online resource, then they can significantly drive down the number of calls answered — saving everyone precious time. Large language models, introduced to a businesses desired outcomes, can analyse customer conversations at scale, allowing you to monitor the impact and value of journey optimisation processes in near real time.

“They call me the model whisperer – why should I fine tune?”

Firstly, congratulations. Secondly, similar to our scenario, we’re all about gaining maximum efficiency. The challenge we’re always looking to solve for is that this new of LLMs are fantastic at general purpose applications — but (hopefully) have no knowledge of your unique proprietary data.

Now, the use of highly capable models in conjunction with tailored prompting can do wonders. However, the price per token cost associated with using a large general purpose model (such as Llama3.1–405B or Mistral-Large) can be significant, thanks to their enormous size and scale. Equally, extensive prompt engineering also brings with it that additional cost per prompt — amplified further if there’s an over-reliance on techniques such as RAG. For some use cases, the prompt and scale may be non-consequential. But, given our intents example, individual call numbers can spiral into the hundreds of thousands per day. It’s imperative that the runtime processing is as streamlined as possible.

Back to fine-tuning. By introducing a model to a sample set of highly curated data, business definitions, and example outcomes — you can enable more nimble models (such as Llama3.1–8b & Mistral-7b) to excel far beyond their base accuracy, keep their low latency response times, and all for a fraction of the inference cost.

This is achieved through a process called Parameter Efficient Fine-Tuning (or PEFT). Think of it as adjusting a small number of additional parameters that you introduce to the model, aiding it in it’s understanding of new domain-specific knowledge in the process. Until recently, fine-tuning a model felt like a dark art. Cortex AI alleviates common challenges through an incredibly straightforward approach to creating your own customised model. Simply call the serverless fine-tuning process directly on a table containing example prompts and completions (questions and answers), referencing the model name you wish tune. Cortex handles the rest, including the various nuance associated to each model — even storing the result directly in the model registry for immediate token-based inference.

No need to provision expensive compute to house that model, or handle a complex code base, Snowflake abstracts so much of the pain. The glorious ease of use and cost efficiency of serverless, all for a bespoke model unique to your enterprise data and enterprise.

“You Rang?”

To demonstrate this in practice, we’re going to use the LLM Serverless Fine-Tuning Solution developed by Dash Desai and Vino Duraisamy. At it’s core, this solution classifies support tickets by analysing the content initially using a large model, using that generated data to guide and fine-tune a the smaller, cheaper, model.

Fine-tuning by standing on the shoulders of giant models.

It’s not uncommon for Generative AI workloads to start off with a validation exercise using some of the largest models in the business, primarily as means of assessing feasibility. You’ll encounter scenarios where a Llama3.1–405B or Mistral-Large standard of model may excel at a given task, but the much smaller Llama-8b and Mistral-7b falls short. For example below, we can see how the large models clearly follows instructions, but the small model adds unnecessary flavour to the response.

It may be tempting at this point to stick with the much larger model for simplicity, but simplicity comes with a premium. Mistral-7b’s cost per million tokens also happens to be 42x cheaper than its larger weighter counterpart. If we can fine-tune to increase that model’s quality and accuracy to match this task, it’s an enormous justification given the scale we’re looking to process in our scenario.

To fine-tune well, we’ll need to build a small training corpus of example prompts and completions. The emphasis is on small here — we’re tuning not training. The dataset required will ultimately depend on the particular use case — but general consensus is : don’t overdo it. 50–100 examples can be enough for a model to absorb the task, and LLMs have been observed to respond to potentially even fewer and still learn new skills. The quality is far more important.

In the case of our intent and classification example, a call may have an identifiable outcome we could immediately handle (eg. “how do I check my remaining minutes?” : balance check), it could be a more nuanced topic (“why did I receive roaming charges…” : complaint — roaming), or even a situation where routing to a human would be a positive outcome (“I’d like to cancel my account” : account cancellation). Ideally, the scenarios we’re looking to identify and extract should be present in the training data set to aid with the accuracy.

When fine-tuning, foundational ML approaches still apply. The training data should hold representative examples of outcomes required —and there should be enough random sampling to prevent overfitting. Again, start small, you’re not attempting to teach a model a vast new set of knowledge [especially in the same way you would introduce knowledge with RAG] — you’re guiding a model to use its leverage knowledge and skills more efficiently.

80:20 training to validation split with random sampling

So start with a small model and an appropriate training set as the baseline — we can always opt to iterate and introduce more data later. All we now need to do is call the Cortex fine-tuning service, and point it at our curated data. The hard work portion is this exercise is over — the Cortex service will create the service and manage the process through to completion with maximum efficiency.

We could fine-tune 4 models simultaneously using the same training set just by swapping a parameter. Cortex is fun!

Once the service finishes updating the appropriate parameters, that model gets logged to the Snowflake Model Registry as a first class model object. Considering that there may be multiple fine-tuned models for many given tasks, this is vital for discovery and management.

Better still, the fine-tuned model is made available for immediate inference through the same Cortex Complete function as foundational models. Not only does this make it incredibly simple to call, test, and evaluate the new specific models — fine-tuned models also benefit from token based pricing. No need to acquire provisioned throughput for each use case, simply train your model and pay as and when you need to use it.

To wrap, in almost no time and with very little effort, we’re able to take a small efficient model and — through careful tuning using a high quality training set — increase the model’s accuracy to match that of far more expensive models. Serverless fine-tuning makes this easy — slotting powerful, small, tailored models directly in Snowflake’s unique Cortex AI layer. As enterprise adoption of LLMs evolve and mature, fine-tuning will be an integral part in ensuring that workloads (such as classification and intent analysis) remain cost efficient and accurate, a powerful tool in AI-led success.

Identifying Intents Efficiently with Serverless Fine-tuning

Fine-tuning is dead. Long live fine-tuning.

“You Rang?”

Written by Tom Christian