LLM-ops vs ML-ops

4 min readJun 15, 2023

LLM-ops: Similar yet different than Ml-ops

While there is great interest in Language Models (LLMs) and the possibilities they offer, operationalizing LLMs requires dedicated efforts and a different set of skills and tools than traditional machine learning (ML). Although MLOps has been well-established, discussions have emerged around LLM ops. In this blog, I aim to differentiate and draw parallels between the two, leaving it to your judgment on whether to call it LLM ops or consider MLOps sufficient.

I will delve into the stages of LLM ops, highlighting the unique aspects and how they differ from traditional MLOps. Considering the early adoption phase of LLM technology and the practical limitations in development, it is reasonable to assume that many of us will utilize foundation models and fine-tune them rather than building large language models from scratch. Thus, I will focus on the steps beyond choosing the foundation model and skip the traditional data pipeline and foundation model building stages.

LLMOps is a discipline that focuses on the operational aspects of deploying and maintaining LLMs. This includes tasks such as:

Choosing the right foundation model
Prompt engineering
Evaluating LLMs
Deploying LLMs
Monitoring LLMs

Choose your horse (Foundational Model):

Factors such as training data volume, the number of training parameters, and the type of data (text, code, images, human instructions, feedback) impact the model selection. These considerations, combined with your specific task, will help determine the appropriate foundation model. Foundation models can be loosely classified into proprietary models and open source models.

Proprietary models (e.g., Gpt-4, PaLM, Claude, Command-medium) that are often accompanied by APIs, resulting in fewer infrastructure overheads. In many ways, as of today they perform better for generalized and wide variety of tasks.
Open source model:
Open-source models require more infrastructure efforts to build serving services. Additionally, licensing constraints may limit their usage for specific use cases. However, task specific models are great and light to start with.

2. Prompt Engineering:

Prompt engineering is the practice of developing and optimizing prompts to efficiently use language models for a variety of applications.

Prompt Design focuses on providing instructions and context to the model to achieve the desired task.

Prompt engineering involves experimenting with different prompts, which can be facilitated through experiment tracking tools or utilizing Git. Experiment tracking tools like Trulens, Mlflow, Comet, and Weights and Biases offer non-technical users a user-friendly interface and the ability to learn from previous experiments. Tracking experiments in prompt engineering, similar to deep learning, can have significant impacts.

Depending upon the task, LLMs require subjective evaluation by humans, recording acceptability ratings for answers becomes a crucial aspect unique to LLM ops, unlike traditional ML ops that rely on objective metrics.

3. Evaluating LLM

Evaluating LLMs is vital to ensure the effectiveness of prompts and the impact of fine-tuning. It involves measuring performance across a wide range of user inputs and data distribution. LLM evaluation differs from traditional ML model testing, as access to training data is typically unavailable when using foundation models. Additionally, evaluating accuracy is not as straightforward since responses are often in text format and highly context and behavior-driven. The design of evaluation metrics can vary based on experimental needs.

There are multiple ways how evaluation metrics mechanism can be designed based out on your experiment. Will expand on those in latter posts.
Great read to get started.

4. Deployment

Deploying Language Model APIs can be straightforward, but complexity arises when intricate logic is involved in API calls. Techniques such as self-evaluation, generating multiple outputs through sampling, and utilizing ensemble or swarm techniques can enhance the quality of Language Model outputs.

5. Monitoring

Monitoring Language Models involves assessing user satisfaction and establishing performance metrics, such as response length and identifying prevalent production issues. Streamlined methods, including user ratings, can facilitate the collection of user feedback. It is important to note that monitoring focuses on the performance of the serving mechanism rather than the actual model performance. This stage also serves as a means to identify any potentially harmful responses. Comparing with traditional ML, this monitoring stage is similar to model endpoint performance monitoring.

To summarize, all the aforementioned steps should be repeated for the development, controlled user (dev), and production environments. Therefore, these stages are cyclic in nature. Out of the five stages, prompt engineering and evaluation stand out as distinct and intricate, with their individual workflows evolving and becoming more complex.

Certainly, you are entitled to refer to it as LLMops if you believe it adequately distinguishes the operational aspects specific to Language Model development. Just as DevOps and MLOps have unique characteristics despite sharing similarities, you recognize that LLMops encompasses distinct considerations and complexities. It is essential to use terminology that accurately reflects the nuances of the field and fosters clear communication within the community.

References:

LLM-ops vs ML-ops

Written by Suyash Bhogawar