The Ultimate Must-Haves and Nice-to-Haves for MLOps & LLMOps

Başak Tuğçe Eskili
Marvelous MLOps
Published in
6 min readJul 1, 2024

In one of our first articles, we shared the essential components required for MLOps, that can offer a guide to start on MLOps. This is particularly useful for large corporations that require a more streamlined approach and benefit from a “golden path” framework.

MLOps is not limited to those components only, in fact, there are hundreds of different tools for different purposes in the Data & AI ecosystem that can enhance your MLOps architecture. Yet still, it’s advisable to stay at the fundamental level and add only necessary pieces. Especially in big organizations, where incorporating a new tool can be difficult and slow, using existing tools is more efficient.

In this article, we’ll give a more comprehensive version of the MLOps and LLMOps toolbelt and cover all critical components required for deploying and managing machine learning models and LLMs in a production environment. In this context, a tool or a service means a component in your full architecture.

1. Version Control

Version Control is important for the traceability and reproducibility of the code base for ML models. It allows you to easily pinpoint the exact code responsible for a specific run or error. It facilitates teamwork and collaboration. Version control systems also provide features such as branch protection rules and approval processes to ensure code quality.

Examples: GitHub, GitLab, BitBucket, Azure DevOps.

2. CI/CD

Continuous integration and deployment pipelines are essential for seamless workflows. You can use it to automate tests and ensure code quality and consistency before deployment. Deployment to production should only occur through the CD pipeline, maintaining a controlled and reliable release process

Examples: GitHub Actions, GitLab CI, Azure Pipelines, Jenkins, CircleCI.

3. Workflow orchestration

An end-to-end ML cycle contains multiple steps, such as preprocessing, feature engineering, model training, and model deployment. A workflow orchestrator helps to manage dependencies between these steps, automate tasks, and ensure that tasks run in the correct order.

Examples: Apache Airflow, Databricks Workflows, AWS Step functions.

4. Model Registry & experiment tracking

For proper model management and deployment, it is important to store and version your trained model artifacts with their associated metadata. This way, you can track different versions of your models, reproduce previous experiments, and maintain consistency across different environments (development, acceptance, production). Versioning also helps debugging and provides a clear history of changes.
Experimentation allows data scientists to try different algorithms and hyperparameters to optimize model performance. It is an iterative process of developing and testing models while keeping detailed records of each experiment and its parameters. With experiment tracking, each run can be reproduced and compared to others.

Examples: MLflow, Neptune.ai, Weights & Biases, Comet.ml.

5. Container Registry

In some use cases, you need to store and manage Docker images, which are used in your ML workflows. These images could be used for model training, testing, and serving. With versioned Docker images, you ensure the environment stays consistent across different stages of the ML lifecycle, supporting reproducibility and scalability.

Examples: Azure Container Registry, Docker Hub, Amazon ECR.

6. Compute (Model training & serving)

To run your processing, training, and evaluation scripts and to serve your model in real-time use cases, you need a compute resource. It can be on-premises or in the cloud. The most important requirement is that your scripts should run consistently across different environments — development, acceptance, production, etc. — without requiring modifications.

Examples: Azure ML, AWS SageMaker, Google Vertex AI, Databricks, Kubernetes.

7. Feature Store & Serving

A feature store is a central repository for managing and serving features used in ML models. It provides reusability of features across different models, and teams, ensures consistency, supports large datasets, and handles high query volumes.

Examples: Feast, Hopsworks, Databricks Feature Store, AWS Sagemaker.

8. Monitoring

Monitoring ML systems requires more, than standard software applications. It involves regular checks on the model performance to catch and address unexpected predictions by tracking metrics such as model accuracy, latency, and key KPIs. Setting up alerts and creating dashboards to monitor system health and performance is always a good practice, ensuring that any issues are promptly identified and addressed.

Examples: ELK Stack, Splunk, Prometheus + Grafana, AWS SageMaker.

9. Labeling

Labeled data is required for ML models, especially supervised learning tasks, and it’s not always provided out of the box. The quality and accuracy of labels have a direct impact on the performance of the model. Labeling tools can provide a nice interface, and quality assurance, and allow for collaboration between multiple annotators.

Examples: Amazon SageMaker Ground Truth, Labelbox, Scale AI.

10. Responsible AI

Responsible AI is an approach that considers ethical and legal points when developing and deploying artificial intelligence (AI). The goal is to create safe, reliable, and ethical AI applications.

For small or large organizations, implementing Responsible AI within the end-to-end ML cycle ensures compliance with regulations, mitigates risks and biases, prevents unfair AI systems, and builds public trust. This is not only important for protecting one organization’s reputation, but it also has an impact on how AI is perceived by society.

Examples: Guardrails AI, Arthur, Fiddler, AWS Bedrock Guardrails.

11. Vector Database

With the increased popularity of LLMs, vector databases have become inevitable for many LLM-based use cases. A vector database is a collection of data stored as mathematical representations, as known as embeddings, alongside their metadata. There are many vector database providers.

Examples: Quadrant, Weaviate, Opensearch, Pinecone.

12. Model Hub

A model hub is a collection of pre-trained models or endpoints that can be directly used in an LLM application. These models, whether open-source or proprietary, are provided by different organizations. They are often preferred over training a large language model in-house because they require significant time, computational resources, and huge amounts of data for training.

Examples: Sagemaker Jumpstart, AWS Bedrock, Hugging Face, Github.

13. Human in the Loop

In some ML models, involving humans in the decision-making or validation process can help increase performance, especially in situations where the model’s predictions or actions are uncertain, risky, or require human judgment. Large Language Models (LLMs) can particularly benefit from this approach. It is also known as Reinforcement Learning from Human Feedback (RLHF), where human-provided feedback is used in training or fine-tuning reinforcement learning models. Amazon A2I and SageMaker Ground Truth are services that can be used to implement this.

14. LLM Monitoring

Monitoring and analysis of LLM applications is necessary to ensure their performance and security. It improves the explainability of models by providing insights and helps detect issues quickly by offering end-to-end visibility.

Examples: LangCheck, HoneyHive, Langtrace AI.

15. Prompt Engineering

Prompt engineering is the practice where the inputs for mainly LLMs are designed to produce optimal outputs. Even though LLMs mimic human answers, it helps to adapt instructions to get higher-quality, more relevant, and also more secure answers.

Examples: Promptflow, MLflow.

16. LLM Frameworks

LLM frameworks are the architecture and software tools that help to develop, train, and deploy LLMs. They reduce complexity and make it easier to create LLM-based AI applications.

Examples: Llama Index, LangChain, Hugging Face Agents.

Conclusion

As we said before, we are not fans of end-to-end “MLOps tools” that claim to do it all. In essence, it is all about how you use these tools and integrate them with other systems. In any organization, introducing a single tool can result in complicated sourcing and security discussions. Investing in MLOps is always worthwhile, both money and time-wise, but it is also important to keep it at the most efficient level.

The goal is not to create a complex architecture but to design a simple system with all the necessary functionalities.

--

--