Training LLMs Like a Pro: Guide to Unleashing Business Impact with LLMs

Published in

Intuition Matters

11 min readNov 28, 2023

Data Science, with its multidisciplinary nature is continuing to expand its wings and has ushered a new era of Innovation. The latest buzz word, Generative AI — has seen a massive uptick in the last 6 months. This data trained technology has grown exponentially in a very short timeframe and reignited the debate about Machines replacing humans. This debate gained momentum as Gen AI algorithms have outperformed humans in various tasks, achieving successes like passing MBA exams. No technology has been as transformative as AI when it comes to the idea of machines replacing humans. This is the power of AI, and it involves teaching a model to emulate human thought processes based on extensive historical data.

Missing out on chatting about LLMs and Gen AI? That’s like ordering pizza and forgetting the cheese — missing the best part!

For the data scientist in me, this opens up a huge learning opportunity, and another area to demonstrate how data is the new oil. With every evolution in the field of AI and the model’s ability to learn and adapt, the significance of data becomes increasingly evident, akin to a valuable natural resource.

With the rise of the new technological paradigm in the form of LLMs, and ChatGPT becoming a buzzword among leaders of all industries, there is fear in data science fraternity, that this transformative wave may also encounter similar issues as its ancestors , like data quality, limited adoption, and the ability to truly make a meaningful impact. The intent of this article is to ensure you don’t fall into the trap of failure when talking to business leaders on the LLM use cases and building POCs.

Question Answering Chatbots on Internal Knowledge base are the most commonly known use cases of LLMs & Gen AI across industries, so lets start there. Note that the numbers below are purely hypothetical, and are intended only for the purpose of illustrating the context & providing perspective.

Context: A company with 40,000 employees and has a large team of HR agents who support employees with questions about policies and other HR-related topics.

Problem Statement: Can we create a chatbot that uses the company’s internal knowledge base to answer employee questions on the go, without the need for employees to create tickets?

Scope: There are about 1000 KB Articles, and about 70K tickets are raised. Preliminary analysis showed that 40% of the tickets have a potential for self-serve.

Estimated Business Impact — Assuming, only 50% of potential self-serve tickets can be resolved by this QnA bot, and the cost of resolving a ticket is at a minimum of 50$ (which is on a lower side), the estimated business impact will be 700K USD.
Make sure to estimate the potential business impact in the early phases, and this calculation will be based on understanding the cost incurred in resolving the ticket (total cost of employee divided by man hours spent on the ticket) , multiplied by the number of tickets potentially resolved by the bot.

Primary Users — Employees

KPIs:
- % reduction in Tickets after the launch
- Employee Experience

As outlined earlier, this article is not intended to explain all fundamentals of LLMs, instead, the focus will be towards identifying the potential pitfalls, and the right questions to ask, to avoid them. With this context — lets explore the various phases of LLMs and the right questions to consider . This article is divided into 2 sections, the first section concentrates on building a system capable of answering questions from KB Articles, while the second section emphasises on leveraging this model to generate tangible business impact.

Guide to Unleashing Business Impact with LLMs

Part 1: Building the LLM to a Answer Employee Questions

1. Understand the quality of data being used to train the models

One important step before jumping into the cool stuff, which is similar to any AI / Data Science use case — Data. However, in case of LLMs, it is the text data, from internal KB Articles.
The age old paradigm, with regards to data still applies here as well. Garbage in Garbage Out
Below are a few pointers to be vary about, before jumping into building a LLM.

Is the content in the Knowledge Base articles latest? And when were they last updated?Do we need to spend efforts to update the KBs, so the answers from these are accurate?
How do the Viewer statistics on these KBs look like?
Number of Tokens per article.
Understanding the distribution of Tokens , will help create a strategy for chunking the KBs, and also estimate the cost of creating embeddings for these KBs. Please note that, number of tokens and number of words are not the same. One word can be made up of multiple tokens.

2. Choosing the right Model Tuning Method

Although, GPT4 is readily out of the box, and can be used to answer a lot of questions — it doesn’t know or understand the domain specific to your HR policies. Hence, your best bet would be to take a pre trained model (GPT4, LLAMA, etc) and figure out a way to teach it the context of your data.
At present, there exist 3 prominent techniques for model tuning:

Comprehensive Fine-tuning: This approach involves adapting all LLM parameters based on task-specific data.
Parameter-efficient Fine-tuning (PEFT): It focuses on modifying specific parameters for more efficient adaptation, conserving computational resources.
RAG (Retrieval Augmented Generation): This method combines prompt refinement with the retrieval of information from databases, resulting in answers enriched with contextual information.

This article on Galileo Blog, clearly outlines how to choose between RAG vs Fine-tuning, highlighting their strengths, weaknesses, and the factors that can help you make an informed choice for your LLM project.
Below is a infographic representation of how RAG works

Considering the cost, ease of implementation, transparency & reduced hallucinations, RAG is the recommended choice for the Question Answering Model.

3. Prompt Engineering

Prompt engineering is the process of designing clear and specific instructions for the LLM to generate the best possible results. By carefully crafting prompts, we can guide LLM to understand our intentions and produce relevant, accurate, and useful outputs. Prompting is a way of instructing the LLM, how to respond to a question.

Prompt engineering is both a science and an art

Prompt engineering can help mitigate the impact of improper questions asked by end users, however it cannot eliminate it completely.
Here are a few tips, for creating effective prompts for LLMs

Include your Vision & Mission Statement, which helps articulate your goals & objectives clearly.
Give the model a specific role, like “act as an HR agent”, to guide the LLM’s tone, style, and response
Explicitly state your expectations for the LLM’s response. Use phrases like “Be elaborate,” “Forget everything else & respond based on the context below,” or “Respond only if you can find an answer from the content below” to provide clear guidance.
Use straightforward language, avoiding metaphors and slang, to ensure the LLM understands your instructions accurately
Specify the desired tone and politeness in the response. For example, you can instruct the model to “respond in a formal tone.”
Specify the format you require for the LLM’s output, such as XML, to facilitate further processing or presentation.
Be mindful of the token limit, which represents the maximum number of words the LLM can process. Structure your prompt concisely and add a phrase like “Limit response to [number] words” to ensure the LLM has sufficient tokens for the response.

Careful and thoughtful prompt design is a valuable practice for optimising the performance of LLMs. It helps reduce hallucinations, ensure fairness & bias are handled correctly, gives you control over the output and allows you to customise the style, tone, or format of responses.

4. Choosing the right Embedding Technique & Retrieval Mechanism

This step is a big deal — it’s the crux of the RAG technique. We’re talking about how we make the algorithm understand what’s written and how we find the right content when someone asks a question. Below are a few important aspects to consider:

Vectorizing KB Articles
Selecting the right embedding model, contemplating among options like Ada by Open AI, INSTRUCTOR-XL etc. The choice should be make considering the budget (open source, vs licensed), number of tokens, etc. A vector database is used to store these embeddings, for faster retrieval.
Segmenting Knowledge Base Articles — Dividing the Knowledge Base articles into smaller chunks for effective retrieval. The decision on chunked vs un-chunked articles and size of each chunk, should be taken keeping in mind the token limits of the LLMs.
Retrieval Mechanism
This is the core of RAG, making sure the answers are to the point. It is about figuring out how to find articles/chunks that are most related to what the user is asking.

Two pivotal questions to be addressed here:
a) What similarity metric is used to compare the embeddings?
b)How many similar articles or chunks should we included in the prompt as context?

5. Choice of LLMs for generating responses

In my previous article Unveiling the Intuition behind LLMs and Gen AI : Demystifying the Hype , I have discussed various LLM architectures & choosing the right architecture for a use case. Considering the use case at hand, is a question answering bot, which comes under the category of Text Generation, we will need to use a Autoregressive model . A few examples in this category are GPT 3.5, Bard, Bloom, etc. Below are a few points to consider when choosing an LLM:

Budget Constraint: This helps make the choice between open source and licensed LLMs, which can significantly impact your budget.
Performance Efficiency: Consider how efficiently the LLM performs for your specific task or application.
Token Limits: Different LLMs come with varying token limits, which represent the maximum number of words the LLM can process in a single prompt or response. Choose one that aligns with your content requirements.
Available Infrastructure: Assess the infrastructure available to you for deploying and running the LLM. If you have limited hardware resources, consider cloud-based or API-based solutions

Additionally — the infographic from Janna Lipenkova on Towards Data Science has gained a lot of tracking when it comes to choosing the right LLM.

Choosing the right language model for your NLP use case

Part 2 : Ensuring the LLM is being used to create actual business Impact

6. Metrics for comparing/evaluating LLMs

Evaluating LLMs requires a holistic approach that considers both technical and task-specific metrics. This combination is critical in achieving an in-depth assessment of how well LLMs perform. Technical metrics establish a baseline for the model’s general linguistic capabilities, while task-specific metrics assess how well it works in particular areas or tasks.

Traditionally, metrics like accuracy, precision, recall, etc has served as a standard for evaluating machine learning models. However, in case of LLMs, the concept of accuracy is not applicable

Technical metrics such as BLEU score, perplexity, and others available on forums like Hugging Face’s leader board, which offer a platform to submit models for automated evaluation on GPU cluster considering industry accepted benchmarks for overall language understanding and generation.

These technical metrics, however might not be able to capture the nuances of specific tasks, such as question answering or text summarisation. These limitations can be addressed by Task specific metrics, which are analogous to accuracy and specificity in traditional ML models, and help in assessing the LLM’s performance on particular tasks. Below are a few task specific metrics to consider:
a) Similarity : Evaluate the similarity of the generated response with the actual intended response
b) Consistency : measure the LLM’s ability to produce consistent responses across multiple prompts, with minor changes.
c) Verbosity : how many words are being generated.

These metrics require creating a sample of QnAs to test the model, which can either be created by the subject matter experts, or based on historical tickets.

7. Creating a consumption layer

How will the end users consume the output of the Question Answering model? Accessibility of the model will be crucial. It might seem easy to say that it will be integrated into the Ticket Generation workflow for employees, however, the constraint is that, it requires integrating with a 3rd party tool like service now, to be able to integrate with the workflow, which is a tedious effort and requires difficult conversations.

It would require the leadership to be absolutely convinced that the model works, before these conversations start. Hence, consider creating a quick and simple interface either in forma web application (using python web frameworks like Django, streamlit, flask, etc) or if your organisation is slack heavy, a slack bot would be most convenient to build.

8. Setting Expectations with Business Stakeholders:

Business leaders would focus primarily on how will this model increase the efficiency or productivity of employees? The expectation would be, that LLM is a magic wand, and just creating model using GPT will drive business impact. But, as a data scientist you know the hurdles in this part, and Be sure to think about it and set the expectations right , in advance, than stumbling during the meeting.

Communicate the anticipated cost & get approvals for budget
It is important to estimate the costs and keep the business stakeholders informed about the one time cost incurred in developing the LLM & ongoing cost for deploying it in production. Below are a few considerations for cost estimations:
a) Compute Infrastructure — Estimate the required compute resources (GPU instances) and their hourly cost based on cloud pricing and instance type.
b) Data Storage : Storing large data sets for training & finetuning LLMs can incur costs for cloud based storage solutions.
c) Token Usage Patterns: Most LLM providers have a token based pricing model.
For example, if you choose to use the GPT 3.5 (4K) from Open AI, here is how to estimate the cost:
Estimated Input Length: 500 tokens (approximately 375 words) per prompt
Estimated Output Length: 500 tokens (approximately 375 words) per response
Average Interactions per Day: 100 interactions
Pricing Model: $0.002 per 1,000 tokens
Daily Token Usage: (500 tokens input + 500 tokens output) * 100 interactions = 100,000 tokens
Daily Cost: $0.002 * 100,000 tokens = $2.00
Monthly Cost: $2.00/day * 30 days = $60.00
Annual Cost: $60.00/month * 12 months = $720.00

Estimating cost upfront can aid in assessing whether investing money on this project is worthwhile & has the potential for a positive return or if the application is merely riding the wave of excitement without any substantial future impact.
Support required from SMEs for validation
When you train the LLM to respond to questions, only the subject matter experts will have relevant context on what type of questions to ask, and also provide feedback on whether or not the generated answers are accurate. Also the SMEs will be able to create a a sample set of questions & answers to run the evaluation metrics discussed in the Evaluation Metrics section.
Coach the users to ask the questions correctly.
Prompt engineering can help mitigate the impact of improper questions asked by end users, however it cannot eliminate it completely. It becomes important for users to be coached on how to ask the right questions.

Alongside the above mentioned caveats , the development and implementation of LLMs must carefully address various use-case-specific factors, including security and data privacy. These concerns are particularly relevant given the potential for LLMs to process and generate sensitive information. Few other use cases where LLMs can be of real value could be in Text Summarisation, Language Translation, etc.

As LLMs continue to evolve at a fast pace, it is crucial for data scientists to remain vigilant in identifying and addressing the ethical and societal implications. By proactively addressing these concerns, we can harness the power of LLMs while ensuring their responsible and beneficial use.

Hope you enjoyed reading this, and now understand the questions to consider & pitfalls to avoid when thinking about LLM use cases. Please feel free to comment your experience with LLM, and your success stories of driving “real” business impact with Gen AI.

Follow Intuition Matters to get a deeper understanding of core concepts behind the buzz words through the power of intuition.