LLMOps: My Thesis & Market Map

Rachit Kansal
12 min readJul 1, 2023

--

(Image Source)

Table of Contents

1) Introduction
2)
Why does LLMOps matter now?
3)
Issues with LLMs in production?
4)
LLMOps Market Map
5)
My predictions for LLMOps
6)
Concluding Remarks

Introduction

In my previous article, last year, I explored MLOps and highlighted four verticals that I believe offer exceptional investment opportunities.

However, over the past year, Machine learning has moved at breakneck pace, with many arguing that AI has finally crossed the inflection point that was being promised since decades. The emergence of Large Language Models (LLMs), notably exemplified by OpenAI’s ChatGPT, has captivated the public’s imagination. Generative AI has even shook the enterprise market with almost all companies, big or small, are actively exploring avenues to integrate AI capabilities into their products and services. This is also reflected in the large amount of money being poured in by the VCs (including many companies such as Amazon’s new Generative AI fund) in the generative AI ecosystem.

(Source: Pitchbook Report)

As I have argued previously, the proliferation of any new technology relies heavily on the availability of robust tooling and infrastructure. In the world of enterprise technology, particularly managed services, the selling point goes beyond mere accuracy or an advanced feature set, as competitors can swiftly catch up in those aspects. The true differentiator is offering a superior developer experience: streamlined setup, effortless usage, and reduced overhead. In case of Large Language Models (LLMs) and AI, there is a pressing need for precisely such a superior developer experience along with higher accuracy.

AI Value Chain (Image by Author)

So, What is LLMOps? “LLMOps” encompasses everything that a developer needs to make LLMs work — development, deployment, and maintenance. Basically, it is a new set of tools and best practices to manage the lifecycle of LLM-powered applications.

Many companies and individuals are already using LLMs to power their applications: Notion (writing assistant), Github (programming assistant), Microsoft (office assistant), Michelle Huang (chatting with self), Bhaskar Tripathi (reading PDFs), and many more! And as expected, taking LLMs into production is not easy. As Chip Huyen puts it:

“It’s easy to make something cool with LLMs, but very hard to make something production-ready with them.”

Why does LLMOps matter now?

LLMs have evolved at breakneck pace since Google published the original Transformer Paper in August 2017. In fact, you can look at this amazing infographic to learn more about the evolution of LLMs till date.

Rise of LLMs (Image by Author)

The case I’m trying to make is that, LLMs have come of age — even with several chinks in their armor, todays LLMs are good enough for several tasks. Hence, to enable the usage of LLMs by businesses and individuals, we need robust tools and platforms.

But before diving into the challenges and opportunities in LLMOps, I want to point out my inherent assumptions while thinking about this space:

  • I see a future where LLMs go beyond just generating texts, images, music, etc. to directly call APIs, execute code, or modify system resources — LLMs will become the new interaction layer for software
  • LLMs will generate and trigger complex dynamic workflows
  • LLMs would integrate, inference, and work with each other, including differing modalities such as text, image, code, audio, and video

As a result, LLM Infrastructure is a space that is ripe for innovation and consequently investment!

Issues with LLMs in production?

  1. LLMs are expensive: Large language models are very expensive to train since it requires constant experimentation and re-training on new datasets (OpenAI GPT model was initially trained on data until 2021) to prevent model from getting stale. More importantly, it is the inference costs that are steep (Google can lose $30B)!
  2. Fine-tuning is hard: Only a few companies are mature enough to continually fine-tune their models and keep their data pipelines healthy, especially in today’s world where most of the data is shared across code, services, product teams, or even organizations. LLMs for all their goodness can hence become an architectural nightmare in production — you cannot just train once and deploy-forever.
  3. LLMs hallucinate: This is a fancy way of saying — LLMs can lie! This is a major concern (especially in the ear where fake news, deepfakes, etc. are so common) with LLMs as it can spread vast amount of misinformation (even ChatGPT) due to their proliferation. Further, trying to understand why hallucinations happen is hard because the way LLMs derive their output is a black box. However, on a high-level, we know that data quality, generation method, and the input context affects hallucinations.
  4. Scale and latency: Client-side orchestration is a much easier problem to solve than server-side orchestration. The real challenge is solving for the massive scale requirements of modern applications. Imagine having to train and deploy an LLM in a distributed setting with caching, throttling, and authn/authz, etc. and other critical enterprise features to provide the adequate SLAs (required for any large application) in terms of API responsiveness and throughput — it is not trivial!
  5. Privacy and security: We have already seen multiple instances of security concerns over LLMs (Eg: Samsung leak). Prompt Injection has also become popular and an effective tool to bypass the rudimentary security of LLMs. Enterprise adoption will not gain steam without much stronger security measures across the LLM stack.

LLMOps Market Map

As I continue following the evolution of large language models, startups are developing innovative products across the entire infrastructure stack (The LLM stack itself has evolved from the traditional NLP stack). However, not all the sub-spaces within the infrastructure stack are equally exciting. Below, I cover my composition and predictions for the LLMOps space!

LLMOps Market Map(Image by Author)

Client side Orchestration

A. Frameworks: These are the products that help developers to orchestrate the client-side integrations of a generative AI application. This would include tools that aid the developers to hook the deployed models with external software APIs, and other foundation models, and facilitate end-user interactions. Client frameworks also provide mechanisms that enable end-users to chain prompts with external APIs. Such frameworks help in breaking an overarching task into a series of smaller subtasks, mapping each subtask to a single step for the model to complete, and using the output from one step as an input to the next. Further, these frameworks can provide marketplaces for the pre-built client-side workflows with integrations.

B. Prompt Management: First up, what is prompt engineering? It is a technique to tweak your intent i.e., your questions to the LLM such that the output matches your expectations as closely as possible. The OpenAI Cookbook provides a bunch of tips to improve your prompts. Now, back to why prompt management is critical. I believe that applications will be composed of multiple LLMs, enabling developers to select the most appropriate model for their specific task and other considerations such as domain knowledge, speed, cost, etc. In fact, there is a growing consensus that applications will have multi-modal architecture glued together by an orchestration layer. Prompts will be the central piece in such a layer and thus there would be a need for prompt engineering tools that are flexible and accommodate a variety of use cases, easy to use (maybe low-code/no-code?), simple to evaluate, traceable and debuggable, allow lifecycle management and versioning, and compatible with the plethora of language models.

Prompt Chaining (Image by Author)

Vector/Data Management (Persistence + Embeddings)

An effective way to leverage LLMs is to generate embeddings from context i.e., information and then developing ML applications on top of these embeddings. The idea is to use these mathematical representation of texts for common operations such as searching, clustering, recommendations, anomaly detection, etc. This is done by running similarity checks on these mathematical vectors. However, these embeddings (i.e., reduce complex texts into mathematical vectors) can become very large since the documents/information may have thousands of tokens. Hence, we need vector databases to efficiently store and retrieve embeddings. The popularity of vector databases has gone up in recent times, driven partly by the AI hype. The data storage and retrieval for LLMs will continue to evolve as the generative AI space itself matures. Hence, there is scope for large scale innovation and growth. We are already seeing a lot of VC activity being poured into vector databases. It would be interesting to see which implementation comes out on top!

Illustration of Vector embeddings (Image by Redis)

Experimentation

Training, fine tuning, and inference of models is both hard and very costly. To put it into context, a back of the envelope calculation illustrates how moving from traditional search to LLM can lead to ~65% reducing in operating income for Google. Hence, we need novel tools and techniques to reduce the costs for training, fine tuning, and inference (as evidenced by this recent Stanford paper). Apart from the cost, the latency for inference is critical for enterprise adoption. Deterministic applications serve APIs in microseconds; no one wants to wait for several seconds just to get a response or trigger an action on an application. While training and inference are more straightforward, fine-tuning has issues beyond just performance. Fine-tuning involves updating the parameters of the underlying foundation model by retraining the model on a more data (can also be a more targeted dataset as opposed to general dataset). An appropriately fine-tuned model can increase the prediction accuracy, improve model performance, and reduce training costs. However, fine tuning a model is not as easy as it looks like! If not done properly, it may even lead to worse outputs. Fine tuning a model not only requires a deep technical expertise, but also plenty of storage and compute resources. Fine-tuning too much can also introduce overfitting, (i.e., the model gets too specialized on a specific dataset and fails to capture the overarching generic patterns) and even cause hallucinations. Hence, we need tools to provide sophisticated tools that are easy to use, provide fine-tuning strategies, and evaluation methods to test those strategies.

Server side Orchestration

Server side orchestration includes all the pieces of code i.e., the machinery that executes in the backend to run the model — deployment, training, inference, monitoring, and security.

A. Deployment: When thinking about leveraging foundation models, enterprises can either use managed models (such as OpenAI, Anthropic, Cohere, etc.) or deploy their own models. However, deploying a model is non-trivial and costly. You need to scale the model architecture, upgrade models with newer version, switch between multiple models, etc. Further, deploying and training model requires powerful on-demand GPU infrastructure. Enterprises will have to weigh in the pros and cons of cloud-based versus on-premise model deployment with respect to cost, latency, privacy, etc. Automated deployment pipelines (CI/CD) would power streamlined training, fine-tuning, and inference capabilities as well as traditional software functionalities such as upgrades and rollbacks with minimal user disruptions.

B. Observability: In production systems, it is critical that we can observe, evaluate, optimize, and debug the code. With LLMs (or AI in general), the issue of observability gets exacerbated due to their blackbox nature. Observability involves tracking and understanding performance, including identifying failures, outages, downtime, evaluating system health (or LLM health), and in case of LLMs, deciphering outputs — for example, explaining why the model came to a certain decision. However, LLMs present some unique challenges. Firstly, it is very challenging to determine what “good” performance actually means for the model. Here, one would probably need to analyze user interactions just to assess the model performance. Further, closed source black box models are even harder to understand and explain since we don’t have access to the architecture or the training data. Hence, we need new testing and comparison frameworks such as “HELM” by Stanford, Evals by OpenAI, etc. to provide standardization. LLMs (and in general all machine learning models) also exhibit issues such as model drift — deterioration in model performance due to changes in underlying data distribution (aka stale data). Monitoring the model could help in updating the model with fresh data to mitigate model drift. Thus, tracking model performance and usage is essential to debug potential issues, fine tune the model, or even change the underlying model architecture.

C. Privacy: Here, I’m using privacy as an overarching term for model safety, security, and compliance. With stringent privacy and security laws such as GDPR, CCPA, HIPAA, etc, and many more across the globe, governments are putting privacy at the center stage for any new technological innovation. For enterprises to trust and deploy generative AI models, they need tools that provide accurate evaluations of model fairness, bias, and toxicity (generating unsafe or hateful content) in addition to private guardrails. Enterprises are now increasingly concerned about extraction of training data, corrupted training data, and leaking of proprietary sensitive data (Eg: Samsung case). Apart from this, LLMs just like traditional machine learning models, are prone to adversarial attacks. Hence, we needs products that can protect against prompt injection, data leakage, and toxic language generation; provide data privacy through anonymization; provide access control (such as RBAC) for LLMs; implement adversarial training and defensive distillation; and much more. Such products can help in detecting anomalies and optimize the production model by maintaining its integrity.

Visualizing LLMOps (Image by Author)

My predictions for LLMOps

  1. Two winners in the vector database space: Just as we saw with SQL, document, graph, etc. databases — I foresee that two major players will emerge in the vector database space (one closed source and one open source player). While the database market is large (and thus has a long tail of players), we have always seen only two/three major players taking most of the market share.
  2. Multiple client-side orchestration frameworks will emerge: LLMs are still evolving at a rapid pace and so will the client orchestration frameworks. New and multiple approaches will emerge that will enable developers to integrate APIs, Auth systems, Databases, etc. Prompt management might get rolled up to become a part of robust orchestration offerings. Ultimately, the products that provide excellent developer experience with robust integration support will excel.
  3. Individual LLMOps products will converge paying the way for robust platforms: Currently, most LLMOps products tackle one (or a few) aspects of the LLM stack such as prompt, deployment, monitoring, etc. I feel that most of these startups will expand and converge to provide a broader range of capabilities — prompting, distributed training, deployment, monitoring, versioning, etc. together as an end-to-end solution. This means most startups that may not be competitors today will end up being competitors. This also means, there will be significant M&A among startups (Eg: Databricks’ acquisition of MosaicML).
  4. Significant M&A activity in the LLM observability space: There are a bunch of startups working on the monitoring aspects of LLMs. However, individually they wouldn’t be as valuable to enterprises since LLM will be a part of the overall tech stack. Hence, enterprises would look for complete observability and monitoring platforms. Hence, I see players like Datadog, New Relic, Splunk, Elastic, etc. scoop up one (or more) LLM monitoring startups to bolster their monitoring portfolios.
  5. LLM Deployment startups will be crushed by cloud providers: Startups focussing on “only” deployment of LLMs will struggle (including startups providing Serverless GPUs). This is because, I don’t see the economics for such a service working out due to smaller scale and more importantly, frequent GPU upgrade costs. Finally, big cloud providers have spent years providing compute at cheap rates which will be hard to manage. Further, most enterprises already have one or more cloud accounts and thus have stickiness to the corresponding cloud platform.

Concluding Remarks

Using LLMs in production is hard! We have to tackle both challenging and unknown issues. Further, if a particular task is simple enough then it’s hard to justify a more expensive, less explainable, and albeit slower system than traditional solutions. Hence, as the generative AI space keeps maturing with newer innovations and disrupts the industry, we will need powerful LLM infrastructure to support the ecosystem. Thus LLMOps is one of the most exciting spaces.

If you are investing in this space, ideating or building your own venture, or have thoughts about AI or this article, I’d love to hear from you!

Rachit Kansal

Disclaimer: The opinions expressed in this blog are solely those of the writer and not of this platform. The writer is not a member of or associated with any of the firms mentioned in the blog. The views in this blog are solely my own and do not represent any of my current or prior workplaces.

Additional references and reading material:

[1] https://home.mlops.community/public/events/llms-in-production-conference-2023-04-13

[2] https://shabazpatel.substack.com/p/rise-of-large-language-model-operations

[3] https://cobusgreyling.medium.com/prompt-chaining-large-language-models-a7c9b66eb5f9

[4] https://gradientflow.com/the-future-of-prompt-engineering-getting-the-most-out-of-llms/

[5] https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/

[6] https://duarteocarmo.com/blog/llms-lessons-learned

[7] https://recsysml.substack.com/p/staleness-of-recommendation-models

[8] https://speakerdeck.com/inesmontani/incorporating-llms-into-practical-nlp-workflows

[9] https://medium.com/@vbsowmya/llms-in-production-the-missing-discussion-ff8035f45a68

[10] https://twitter.com/sh_reya/status/1641106353971421185?s=46&t=SKZm3ndslERqEDd3XEsvyw

[11] https://machinelearningmastery.com/a-gentle-introduction-to-hallucinations-in-large-language-models/

[12] https://medium.com/@datasciencedisciple/why-do-large-language-models-hallucinate-d78dfac0f842

--

--

Rachit Kansal

Passionate engineer, aspiring business leader, and devout learner