Easily Deploy Multiple LLMs in a Cloud Native Environment

Intel

Published in

Intel Tech

8 min readApr 23, 2024

Take the complexity out of deploying cloud native LLMs with LangChain and Intel Developer Cloud

By Arun Gupta and Ezequiel Lanza

Imagine you could give a personal assistant to each employee. Productivity across your organization would spike as employees can see how AI can help them and feel empowered to focus on strategic thinking. This dream scenario is possible with powerful AI technology, such as department-specific chatbots that deliver fast, high-quality results.

However, businesses often need to weave together multiple large language models (LLMs) to support diverse use cases. Because each model may have different compute and storage needs or specific knowledge to be used by internal departments in unique ways, the complexity can quickly skyrocket.

The right set of tools can take the complexity out of the deployment process. Here we’ll explore a reference architecture for building and deploying multiple LLMs in a single user interface with Kubernetes and LangChain. You can also find a thorough explanation of each step in the corresponding GitHub repo, which is structured as an educational resource with recipe files for you to download and create your own containers.

We demo the complete steps for this approach in a KubeCon + CloudNativeCon Europe 2024 presentation.

KubeCon + CloudNativeCon Europe 2024 presentation

Step 1: Define Your Model

Hugging Face makes it easy to download models, so you can start inferencing locally right away, but how do you know which LLM to choose? There are three important considerations to keep in mind as you evaluate your options, in addition to deciding whether you should deploy the model locally or externally (which we’ll address in the next step).

· Performance: Before downloading a model, you can see how it stacks up against industry benchmarks by comparing it to other models on the Hugging Face Leaderboard, a public ranking system of open LLMs.

· Community support: You don’t want to choose a model that no one uses or maintains. Look for a model with widespread community adoption, an active contributor base, and strong documentation with helpful resources like tutorials. These are all signs of a thriving community that can offer the help you need down the road.

· Ethical considerations: Sometimes models generate biased results around traits like ethnicity or gender. Choosing a model that trains on diverse data and is transparent about its processes can help you mitigate bias and ensure your results are fair.

Depending on your use case, you may also want to optimize your model. A tool like Intel® Extension for Transformers, for example, can help you shrink the RAM needed to store a 7 billion parameter LLaMa2 chatbot model from 26 GB to 7 GB.

Step 2: Choose a Consumption Model

Once you’ve picked a model, you need to consider where you’ll consume it. Local models can be stored on your internal servers or even laptops if they are optimized, while external models allow you to use LLMs hosted by a third party. Factors such as cost, storage capacity, and how you plan to handle sensitive data will help determine which consumption model is right for you.

For example, if your model will be used by financial or legal departments, which are often subject to data privacy regulations, you may need to use a local model so you can inference without ever sending sensitive data outside your organization. Local models give you more control to choose how and when you use them, enabling your team to use models offline and customize them to your business, also known as fine-tuning. Additionally, local models can potentially benefit from cost efficiencies, as some external models can require you to pay per outbound and inbound token . When using a fee-based external model, you will be paying both to send a query and receive a response. You’ll want to ensure that your prompts are well engineered and phrased precisely to draw out the correct answer. Otherwise, you’ll be sending multiple prompts and inflating your fees.

However, there are advantages to external models that require much less compute power and storage. Say you have a 7 billion parameter LLaMa model, which requires about 26 GB RAM. To inference the model locally, you’d need sufficient storage space on your server plus the necessary compute power provided by local CPUs and GPUs to support model inference and your application. In an external consumption model, however, you’d only need enough compute power to run your application — simply make an API call and start inferencing. Additionally, because a third party manages infrastructure resources, external models are typically simpler to set up, faster to deploy, and easier to scale.

You can use a pricing calculator to start estimating the cost of an external model based on the provider and number of input and output tokens you need.

Step 3: Package Your Models

As you narrow in on which LLMs you’ll use to support your most important use cases, you need a unified way to manage all models in the most efficient way possible. LangChain is an open source framework that simplifies building and managing multiple types of LLM applications in a single user interface. You can plug in any of the more than 80 supported types of open source LLMs, including local and external models, optimized and unoptimized models, and even advanced techniques like retrieval-augmented generation (RAG).

The LangChain API Chain combines the prompt template and LLM pipeline to generate better results.

However, as you can see, an LLM application is not just a model. Complete applications include a model, parameters, and a tokenizer, which Hugging Face provides in multiple configurations called pipelines. In addition, LangChain offers a prompt template that delivers more context to the pipeline to generate better results. LangChain offers an API called Chain, among others, that pulls together your prompt template and pipeline, so interacting with the model becomes as simple as using “chain.invoke” to send your question and generate a response.

Step 4: Containerize Your Model

Ask a developer how they like to deploy their LLMs, and more often than not they’ll say Kubernetes. Cloud native has become the de facto platform for deploying LLMs because it offers a few key advantages.

· Scalability and portability: After you’ve configured a model in a cluster on your desktop, you can seamlessly scale it across platforms — such as Amazon EKS, Microsoft Azure, or the Intel® Developer Cloud — and production environments, from on-premises to the edge and from small to large node clusters.

· Resource management: AI models consume lots of memory and compute power. Kubernetes gives you more control to fine-tune your resources via CPU and memory limits, resource quotas, and priority classes to funnel power to your most important models first.

· Observability: Many open source projects are using telemetry to enhance visibility and provide more insight into AI models.

Just like the model packaging process in the previous step, LangChain also provides an API to help you easily containerize your model. Start by connecting your container to the file server via a persistent volume claim (PVC) or a persistent volume (PV). Now when you run the container, the “POST” API downloads the model from the file server and places it inside the container.

Each time you run the container, the local or external model (represented by the pipelines here) is downloaded from the file server onto the container.

Step 5: Integrate Multiple Models

Now that you’ve packaged and containerized your model, you may want to integrate additional LLMs that are fine-tuned to new use cases. To do so, you need an LLM proxy — an architecture layer that unites distinct LLMs so you can interact with multiple models at once.

Take for example your organization’s intranet. It connects employees across your organization to the tools they need to complete their work, communicate across business units, and access important information about their employment status, such as pay stubs, PTO requests, tax forms. Attaching your intranet UI to an LLM proxy gives your employees access to AI models optimized to each business unit.

In addition to providing management tools for model provisioning and governance, the LLM proxy may also include additional intelligence to help field AI prompts and direct them to the best LLM to solve the problem.

The LLM proxy connects front-end architecture with the LLMs running on the back end.

To build this in a Kubernetes environment, first create an ingress controller or load balancer, such as NGINX, to expose the front end and the proxy so that the browser can communicate with it. Everything beneath the proxy, such as local models or APIs for external models, will not need to be exposed to the browser and can be deployed internally as pods on the cluster. Once you push your containers to a container registry like Docker Hub, your containers will be accessible when you run your cluster.

Demo: Deploying a Chatbot in Kubernetes in Intel Developer Cloud

In this quick demo, we’ll show you how to deploy a chatbot that can switch between multiple models using LangChain and Intel Kubernetes Service (IKS). IKS is an Intel® Developer Cloud service that lets you test applications on the latest Intel® hardware as soon as it’s available, rather than waiting for a cloud service provider to adopt the hardware. You can also download the necessary files from the GitHub repo to launch your own containers and follow along. As you’ll see, the results appear in a simple, mock UI, using React.

The LangChain application running in Intel Kubernetes Service (IKS) allows the user to easily switch between three models.

Try It for Yourself

AI models can help improve employee productivity across your organization, but one model rarely fits all use cases. LangChain makes it easy to use multiple LLMs in one environment, allowing employees to choose which model is right for each situation. Explore the GitHub repo to get started using LangChain and IKS.

About the Authors

Ezequiel Lanza, Open Source AI Evangelist, Intel

Ezequiel Lanza is an open source AI evangelist on Intel’s Open Ecosystem team, passionate about helping people discover the exciting world of AI. He’s also a frequent AI conference presenter and creator of use cases, tutorials, and guides to help developers adopt open source AI tools. He holds an MS in data science. Find him on X at @eze_lanza and LinkedIn at /eze_lanza

Arun Gupta, vice president and general manager, Open Ecosystem, Intel

Dedicated to growing the open ecosystem at Intel, Arun Gupta is a strategist, advocate, and practitioner who has spent two decades helping companies such as Apple and Amazon embrace open source principles. He is currently chairperson of the Cloud Native Computing Foundation Governing Board.

Follow us!

Medium, Podcast, Open.intel , X , Linkedin