LLM Bootcamp Notes — Part 4: LLMOps

Anirudh Gokulaprasad
4 min readSep 16, 2023

--

The LLM Bootcamp series by FullStackDeepLearning offers great insights into the world of Generative AI by taking a very structured approach to the topic. The series goes from introducing LLMs all the way upto approaches and recommendation for production grade LLMs & Generative AI solutions.

I have consolidated my learning notes in a series of 4 articles, distilling to the 4 core focus areas for LLMs from the Bootcamp series — This part 4 article is on LLMOps

CHOOSING THE BASE MODEL

The best model for the usecase depends on the trade-offs between:

  • Model’s out-of-the-box quality on the box
  • Inference speed & latency
  • Cost associated with the model
  • Tuning & extensibility of the model
  • Data security, privacy & model licensing

Proprietary Vs Open Source — The decision of using a proprietary or open source model boils down a few thumb-rule points, which are:

Proprietary:

  • Higher quality when compared to Open-Source
  • Does not have license frictions
  • Minimal infrastructure overhead

Open-source:

  • Easier customisation
  • Data security can be ensured
  • Permissive licenses like Apache 2.0 gives allows to do more or less of what we want
  • Restricted licenses like CC BY-SA 3.0 place restrictions on commercial use but do not straight up prohibit it
  • Non-commercial licenses like CC BY-NC-SA 4.0 prohibit commercial use

Recommendation for Choosing the base model:

  • Start projects with GPT-4 and prototype in Python
  • If cost/latency is a factor, then consider downsizing
  • GPT3.5 & Claude are good comparable options for this
  • If more speed and lower cost in required, then Anthropic can be considered
  • If fund-tuning the model is needed, then Cohere can be considered
  • Use open-source only if it is really required

ITERATION & PROMPT MANAGEMENT

Iteration and Experiment management are a key part of traditional deep learning workflow. But there are a few subtle differences with the LLM workflow:

  • Experiments are quick — whereas in traditional deep learning we train a model and wait for hours for it to complete
  • Experiments are sequential with prompts as the changes to the prompts are seen and analysed — whereas in traditional deep learning hundreds of experiments should be run at parallel with different models and hyper parameters in order to train /tune the model
  • Only limited experimentation with prompts — whereas in traditional deep learning there are hundreds of hyper parameter combinations and model settings to evaluate

The current situation with prompt engineering and iteration does not need the use of any advanced tooling to manage it. One of the main reasons is that there is no automatic robust evaluation method to say if one prompt is better than the other by evaluating the result from the model — if such a evaluation comes up in the future, then prompt management tooling will become need of the hour.

Three levels of prompt/chain tracking:

  • Level 1 : Do Nothing — good enough for v0 and experimentations
  • Level 2: Track prompts via git — to keep track of what has already been tried
  • Level 3: Track prompts in a specialised tool — this is to run parallel evaluations for multiple prompts at once, decouple prompt changes from the deployment cycle or to even involve non-technical stakeholders

HOW TO TEST LLMs

Build an evaluation dataset -

  • start incrementally building the dataset — start with ad-hoc prompts and gradually organise them into a small dataset that can be used to benchmark/evaluate the model on the task
  • Ask LLM for help — LLMs are good at generating test cases for input-output pairs. This can act like an auto-evaluator
  • build it gradually as the task/functionality expands — add more data with diversity. Eg: Underrepresented requests & topics, dislikes of an another model, and edge questions from users
  • Quantifying the quality of test set — A bit arbitrary as of now, but coverage of test set can be a good starting point. It should cover a reasonable examples of what can be seen in production like the most popular queries, edge cases, unclear intent questions etc

DEPLOYING & MONITORING

Improving the output of LLMs in production:

  • Self critique — ask an another LLM is this the right answer. EG: Guardrails library can achieve this
  • Give multiple answers to the user and accept the one that the user chooses as relevant
  • Give multiple answers and then combine the outputs into one

What goes wrong with productionizing LLMs:

  • Difficult to use UI
  • Latency dependant wait times
  • Hallucinations
  • Too long answers that dodge / miss the point of the question
  • Toxicity in the generated results
  • Prompt injection attacks

CONTINUOUS IMPROVEMENTS

Any ML model is subject to continuous improvement over time and possibilities are well documented for the same. For LLMs, there are 2 ways for continuous improvement:

  • Account for user-feedback to make the prompts better
  • Fine-tune the model if possible for a highly specific use-case

Footnotes

To read the part 1 article of this series — Click here

To read the part 2 article of this series — Click here

To read the part 3 article of this series — Click here

Note — The article is a distilled consolidation of my understanding of the topic. If you find any conceptual errors, please leave a feedback so that I can fix it. Cheers!

References:

--

--

Anirudh Gokulaprasad

GenAI Engineer with ~4 yrs of experience building fully autonomous agentic GenAI platform at scale - https://www.linkedin.com/in/anirudh-gokulaprasad-44328b137/