LLM Bootcamp Notes — Part 4: LLMOps

4 min readSep 16, 2023

The LLM Bootcamp series by FullStackDeepLearning offers great insights into the world of Generative AI by taking a very structured approach to the topic. The series goes from introducing LLMs all the way upto approaches and recommendation for production grade LLMs & Generative AI solutions.

I have consolidated my learning notes in a series of 4 articles, distilling to the 4 core focus areas for LLMs from the Bootcamp series — This part 4 article is on LLMOps

CHOOSING THE BASE MODEL

The best model for the usecase depends on the trade-offs between:

Model’s out-of-the-box quality on the box
Inference speed & latency
Cost associated with the model
Tuning & extensibility of the model
Data security, privacy & model licensing

Proprietary Vs Open Source — The decision of using a proprietary or open source model boils down a few thumb-rule points, which are:

Proprietary:

Higher quality when compared to Open-Source
Does not have license frictions
Minimal infrastructure overhead

Open-source:

Easier customisation
Data security can be ensured
Permissive licenses like Apache 2.0 gives allows to do more or less of what we want
Restricted licenses like CC BY-SA 3.0 place restrictions on commercial use but do not straight up prohibit it
Non-commercial licenses like CC BY-NC-SA 4.0 prohibit commercial use

Recommendation for Choosing the base model:

Start projects with GPT-4 and prototype in Python
If cost/latency is a factor, then consider downsizing
GPT3.5 & Claude are good comparable options for this
If more speed and lower cost in required, then Anthropic can be considered
If fund-tuning the model is needed, then Cohere can be considered
Use open-source only if it is really required

ITERATION & PROMPT MANAGEMENT

Iteration and Experiment management are a key part of traditional deep learning workflow. But there are a few subtle differences with the LLM workflow:

Experiments are quick — whereas in traditional deep learning we train a model and wait for hours for it to complete
Experiments are sequential with prompts as the changes to the prompts are seen and analysed — whereas in traditional deep learning hundreds of experiments should be run at parallel with different models and hyper parameters in order to train /tune the model
Only limited experimentation with prompts — whereas in traditional deep learning there are hundreds of hyper parameter combinations and model settings to evaluate

The current situation with prompt engineering and iteration does not need the use of any advanced tooling to manage it. One of the main reasons is that there is no automatic robust evaluation method to say if one prompt is better than the other by evaluating the result from the model — if such a evaluation comes up in the future, then prompt management tooling will become need of the hour.

Three levels of prompt/chain tracking:

Level 1 : Do Nothing — good enough for v0 and experimentations
Level 2: Track prompts via git — to keep track of what has already been tried
Level 3: Track prompts in a specialised tool — this is to run parallel evaluations for multiple prompts at once, decouple prompt changes from the deployment cycle or to even involve non-technical stakeholders

HOW TO TEST LLMs

Build an evaluation dataset -

start incrementally building the dataset — start with ad-hoc prompts and gradually organise them into a small dataset that can be used to benchmark/evaluate the model on the task
Ask LLM for help — LLMs are good at generating test cases for input-output pairs. This can act like an auto-evaluator
build it gradually as the task/functionality expands — add more data with diversity. Eg: Underrepresented requests & topics, dislikes of an another model, and edge questions from users
Quantifying the quality of test set — A bit arbitrary as of now, but coverage of test set can be a good starting point. It should cover a reasonable examples of what can be seen in production like the most popular queries, edge cases, unclear intent questions etc

DEPLOYING & MONITORING

Improving the output of LLMs in production:

Self critique — ask an another LLM is this the right answer. EG: Guardrails library can achieve this
Give multiple answers to the user and accept the one that the user chooses as relevant
Give multiple answers and then combine the outputs into one

What goes wrong with productionizing LLMs:

Difficult to use UI
Latency dependant wait times
Hallucinations
Too long answers that dodge / miss the point of the question
Toxicity in the generated results
Prompt injection attacks

CONTINUOUS IMPROVEMENTS

Any ML model is subject to continuous improvement over time and possibilities are well documented for the same. For LLMs, there are 2 ways for continuous improvement:

Account for user-feedback to make the prompts better
Fine-tune the model if possible for a highly specific use-case

Footnotes

To read the part 1 article of this series — Click here

To read the part 2 article of this series — Click here

To read the part 3 article of this series — Click here

Note — The article is a distilled consolidation of my understanding of the topic. If you find any conceptual errors, please leave a feedback so that I can fix it. Cheers!

References:

https://fullstackdeeplearning.com/llm-bootcamp/spring-2023/ux-for-luis/

LLM Bootcamp Notes — Part 4: LLMOps

Written by Anirudh Gokulaprasad

No responses yet