Preventing Front Page Headlines: LLM Use Case Assessments

Published in

Marvelous MLOps

5 min readFeb 18, 2024

Co-written with Vechtomova Maria

All companies want to do something with LLMs and use GenAI in their pitch. Sometimes it’s top-down when C-level executives want their company to catch up with the trends, and sometimes bottom to top as an initiative of data science teams.

We have seen some successful examples from big companies, such as Zalando Fashion assistant, implemented to help customers discover the most relevant products based on their query, and Booking AI Trip Planner, which guides travellers in finding an ideal destination or accommodation for their preferences.

Conversely, we have also seen some failures, or let’s call them “things that went wrong”.

Pak ‘n Save is designed with a great intention, to help customers creatively use up leftovers. The user enters ingredients and gets a recipe generated by the LLM app. However, not having control over user queries led to the app generating the recipe for chlorine gas and Pak ‘n Save ending up looking bad in the news.

Another example is from DPD (delivery firm) chatbot. Users ask the chatbot to criticise the DPD company, resulting in the chatbot providing negative comments. This scenario is definitely what the company doesn’t want: their chatbot disparaging the brand.

So how can you prevent making headlines on the front page?

We propose to introduce impact, risk, and maturity assessments. Those should be conducted for any machine learning project, in this article we focus on questions that make sense when LLMs are considered. Impact and risk assessment must be done before a POC, and maturity assessment — after the POC.

Impact Assessment

Every project should start with an impact assessment. In case of a potential LLM use case, carefully answer the following questions:

Business problem: What is the business problem your organization is facing?
Need for LLM: Is using LLMs necessary to solve the identified problem?
Alignment with Objectives: Does the project align with the objectives of your team and the company? Consider factors: strategic priorities, customer needs, and operational efficiencies.
Estimates Costs: What is the estimated financial investment required for deploying LLMs, token usage, license for models, infrastructure, and maintenance?
Estimated Timeline: What are the stages of project implementation, from initial planning and development to deployment and integration into existing systems? Consider factors such as resource availability and technical complexities.
Estimated business impact/benefit: What are the potential benefits? Think about estimated improvements in productivity, cost savings, revenue generation, enhanced customer experience, and competitive advantage.

Risk Assessment

Risk assessments often get forgotten, but are crucial to do before the POC phase. In the case of LLMs, answer the following questions:

Will LLM applications directly interact with customers? Customer-facing apps carry higher risks related to data privacy, accuracy, and user experience.
What is the strategy to deal with hallucinations? LLMs may generate inaccurate, misleading responses.
What are the privacy implications of using LLMs, particularly regarding the handling of sensitive or personally identifiable information?
What is the strategy to identify and prevent the LLM from discriminating?
What are possible scenarios for how user data might be exposed to unauthorized parties? What is the strategy to mitigate the risks?
What is the strategy to deal with possible FM provider outages and unexpected API design changes?
How to make sure that the risk of errors or issues going unnoticed is minimized?
What are possible scenarios of users misusing LLM for different purposes? What is the mitigation strategy?

Maturity Assessment

This is a checklist to measure the maturity of your LLM application. Ideally, you should conduct this at the POC or MVP stage. This is an extension of the MLOps assessment we introduced earlier.

There we talk about some crucial aspects, such as documentation, code quality, monitoring, traceability & reproducibility. LLMOps-specific questions address the maturity of finetuning and RAG systems.

For any 3rd Party Foundation Model Endpoint usage, it’s possible to look up:

Which endpoint & version is used
The structure of request & response
Token usage cost
Latency
Prompts that are sent to the endpoint and response

For generating embeddings (for vector database), it’s possible to look up:

Computational latency (for computing embeddings)
Latency (if an endpoint is used to get embeddings)
How documents are parsed
How chunks are created (the size)
What model is used to generate the embeddings

For storing embeddings in a vector database, it is possible to look up:

How the database is updated with new documents
The metadata is saved together with chunks
Which document and part of the document a chunk is from

For retrieving embeddings from the vector database, it is possible to look up:

How many chunks are retrieved
The strategy for combining chunks
The strategy for integrating retrieved chunks into the prompt
The relevancy ranking
Metadata filtering
Which metadata is retrieved with embeddings
Which similarity algorithm is used

For any prompt [prompt engineering], it’s possible to look up:

How the user query is enriched before it is sent to the foundation model [structure].
How external context is integrated into the user query (for RAG case).

For fine-tuned model deployment, it’s possible to look up:

Corresponding code commit on Git
The infrastructure used for fine-tuning & serving
What model artifact is produced
What training data is used
The retraining strategy is clearly defined, including how often
The methodology is defined (supervised, self-supervised, RLHF)

You can find the list of assessments here.

Conclusion

A year ago, the first companies started to do something with LLMs. By now, some have applications running in production. Often, without having proper guardrails in place. With the proposed assessments we can understand where we are on the maturity scale (how much % of the questions are answered as “yes”) and take actions to make the LLM application more mature.

Resources:

Large Language Models: A Survey. https://arxiv.org/abs/2402.06196
26 Principle Instructions for Questioning LLaMA-1/2, GPT-3.5/4: https://arxiv.org/pdf/2312.16171.pdf
LLM Patterns by Eugene Yan: https://eugeneyan.com/writing/llm-patterns/
RAG for LLMs Survey https://arxiv.org/abs/2312.10997
RLHF: https://huggingface.co/blog/rlhf
Vector Databases Table by Superlinked: https://lnkd.in/eiy9PqB9
https://news.booking.com/how-bookingcoms-ai-trip-planner-can-inspire-travelers-to-plan-that-last-minute-summer-vacation/
https://www.theguardian.com/technology/2024/jan/20/dpd-ai-chatbot-swears-calls-itself-useless-and-criticises-firm
https://www.nzherald.co.nz/nz/paknsave-ai-meal-bot-suggests-deadly-and-toxic-spreads-supermarket-says-it-will-keep-fine-tuning/6BHCPHS7BNGQZJHOBI5H5USAPQ/
https://www.pymnts.com/news/retail/2023/zalando-add-chatgpt-powered-fashion-assistant-online-platform/