How to Deploy a Self-Hosted LLM on EKS and Why You Should
Are you tired of worrying about skyrocketing token costs in production? Are you concerned about how external vendors handle your sensitive data? This post will guide you through deploying a self-hosted LLM on EKS, giving you control and cost efficiency. We’ll explore everything from why you might want to self-host to the tools and metrics essential for the setup. Plus, we’ll demonstrate how to set up a simple chat application to interact with your model.
Why Self-Hosting?
While advanced language models from vendors like OpenAI and Anthropic are super impressive, they’re not always wallet-friendly. Experimenting and developing might not attract the FinOps team’s attention, but when you shift to production, the costs associated with pay-per-token pricing models can really start to add up, and fast.
Yes, the bigger models are expensive, and you always need to evaluate your needs and not just target the bigger models. But even the smaller models will eventually become expensive based on the scale.
At Next Insurance, we started with GPT-3.5-turbo because it worked just fine and was significantly cheaper than GPT-4. But even with GPT-3.5, we noticed that our costs were doubling month over month at an alarming rate.
But hey, it’s not all about the money. Self-hosting offers other benefits that are just as important:
- Data Security —All sensitive information, especially personally identifiable information (PII), stays secure within our network. This setup eliminates concerns about sending data out or worrying about what external vendors might do with it.
- Developer Freedom — Self-hosting gives our developers the freedom to explore and innovate without the constraints of escalating costs and external data privacy concerns. This freedom supports a creative environment where technological experimentation is encouraged, leading to more innovative solutions.
Sure, you might not find many open-source models that can match GPT-4, but there are plenty of alternatives suitable for tasks typically handled by GPT-3.5. Some of these models are even better and cost just a fraction of the price. Deploying these on your own network allows you to control the data. Most importantly, you pay a fixed computing price instead of paying per usage, making costs much more predictable and manageable.
What Tools Do You Need? (and Some Other Considerations)
I’d assume that you’re familiar with AWS and EKS, so we’ll focus here on the different components required to serve an LLM model.
The main areas we need to consider are Compute, Inference, and Model.
Compute
When setting up LLM inference, the GPU — specifically its type and quantity — is the main resource you need to consider. This is because the entire model is loaded into the GPU’s memory (VRAM), and all the LLM calculations are performed on the GPU.
To estimate the amount of VRAM needed, check this guide or follow this simple rule of thumb: multiply the model’s number of parameters (in billions) by two for the base requirement, then add an additional 20% to cover caching and overhead. For example, to serve a model with 7 billion parameters, you would need approximately 17 GB of VRAM (7 x 2 x 1.2 = ~16.8 GB) on one or multiple GPUs.
Inference
For the serving framework, we’ll use vLLM, an open-source framework designed to serve LLM models with an OpenAI-compatible API server. vLLM supports continuous batching, making it ideal for handling multiple concurrent requests and high loads. Additionally, vLLM supports distributed serving if we need to run a model over multiple GPUs or nodes. It uses Ray as the backend for distributed serving, another open-source framework for running large-scale ML applications.
While vLLM supports most features available in the OpenAI API, there are some exceptions that are still works in progress. The vLLM team is actively developing and rapidly releasing new features, so I recommend keeping track of their GitHub page to stay updated on the latest advancements.
Model
Hundreds of models, from foundation models to more specific, fine-tuned versions designed to tackle specific problems, are available on Hugging Face. Think of Hugging Face as the “GitHub” of AI and ML applications—a key place to find just about any model or dataset you might need.
For a good starting point in comparing and evaluating these models, check out this popular LLM leaderboard
When choosing a model, don’t forget to check the licensing. Some models, like Mistral under the Apache license or Phi under the MIT license, are fully open-source. However, many come with semi-commercial licenses. Reviewing these terms is key to ensuring they fit your legal and operational plans.
Bringing It All Together (Demo Time)
All the code for this demo is available on my GitHub, which you can access here.
We’ll use the Mistral 7b instruct 0.2 model for this demo, which is completely open-source under the Apache license. We’ll run it on an AWS g6.xlarge instance that operates on a spot basis, typically costing less than 15 cents per hour. This instance is equipped with an Nvidia L4 GPU with 24 GB of VRAM, perfectly suited to handle our model based on the VRAM estimation rule we discussed.
During the demo, we will deploy a VPC in the US-West-2 region, an EKS cluster with Karpenter on Fargate, two Karpenter providers—one for GPUs and one for standard nodes—the Nvidia Driver Plugin to ensure GPUs are available for Kubernetes and Prometheus and Grafana for monitoring. All of these resources will be set up using Terraform.
Regarding costs, running this demo is expected to be about 30–40 cents per hour. This includes the charges for the NAT gateway, EKS control plane, and all nodes, including the GPU-equipped nodes.
0. Prerequisites
Before diving into the demo, make sure you have the following ready:
- AWS Account — You’ll need an AWS account with sufficient permissions to set up the resources detailed in the demo, including VPCs, EKS clusters, and more.
- AWS credentials — Ensure your credentials are correctly configured in your local environment.
- Terraform — You should have Terraform installed on your machine. Terraform will be used to provision and manage the AWS resources required for the demo.
- Kubectl — As we manage Kubernetes resources, ensure that Kubectl is installed.
- HuggingFace access token—Follow this guide to generate an API access token to pull the model from Hugging Face.
1. Underlying Infrastructure
- Open your terminal and clone the repository by running the following:
git clone https://github.com/eliran89c/self-hosted-llm-on-eks2. Change into the directory:
cd self-hosted-llm-on-eks3. (Optional) Adjust the Terraform code to tailor the setup to your specific requirements if needed.
4. Initialize Terraform and apply the Terraform configuration to deploy the infrastructure (deploying an EKS cluster should take up to 30 minutes)
terraform init
terraform apply5. Set up Kubectl to interact with your newly created EKS cluster
aws eks update-kubeconfig --region us-west-2 \
--name self-hosted-llm \
--alias self-hosted-llm6. Check that Karpenter and CoreDNS are running:
kubectl get pods --all-namespacesThe expected output should look something like this:
7. Ensure that Karpenter providers are correctly in place:
kubectl get ec2nodeclasses.karpenter.k8s.awsThe expected output should show the available node classes:
2. Deploy vLLM and serve the model
- Go to the model page on HuggingFace and accept the model terms and conditions
2. Create a secret with your HuggingFace API access token:
kubectl create secret generic huggingface-token \
--from-literal=token=<your_hugging_face_token>- Replace ‘<your_hugging_face_token>’ with your actual Hugging Face API access token.
3. (Optional) Review the deployment file, specifically the deployment args section. If necessary, modify the engine arguments to suit your specific requirements better. You can view a full list of all available engine arguments here.
4. Deploy vLLM:
kubectl apply -f vllm.yaml5. To enable Prometheus to scrape metrics from vLLM, deploy a ServiceMonitor
kubectl apply -f serviceMonitor.yaml6. After deploying vLLM, it typically takes 2–3 minutes to download and load the model into the GPU. You can check the logs directly to monitor what’s happening during this initialization phase.
First, verify the pod is running:
kubectl get podsFollow the logs with the following:
kubectl logs -f -l app=vllmWhen the model is loaded and ready, you should expect to see the following messages in the logs:
7. Open a new terminal and set up port forwarding to interact with the OpenAI-compatible API endpoint on port 8000:
kubectl port-forward svc/vllm 8000:80008. Now that everything is set up, test the LLM by sending a query via a standard OpenAI curl command. Here’s an example:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistralai/Mistral-7B-Instruct-v0.2",
"messages": [{"role": "user", "content": "What is the capital of France?"}]
}'The expected result should confirm the successful retrieval of information:
3. Setting up a simple chat application to interact with the model
- Create a new Python virtual environment by running:
python3 -m venv .venv2. Activate the virtual environment:
source .venv/bin/activate # On Linux or MacOS
.venv\Scripts\activate # On Windows3. Install the necessary Python packages by running:
pip install -r requirements.txtThese packages include the OpenAI Python client for API requests and Gradio for web interface creation.
4. Start the application by running:
python chat.py5. Once the application is running, open a web browser and go to http://localhost:7860/
Model Monitoring
When it comes to LLMs, several important metrics help us monitor and measure the model’s latency and throughput. These metrics are vital for optimizing performance and ensuring the model operates efficiently. Below are the primary metrics to consider:
Time for First Token (TFFT) — This metric measures the time it takes from submitting a request to the model until the first token of the response is generated. It’s a critical indicator of the initial responsiveness of the model, which is particularly important in user-facing applications where response time impacts user experience.
Time for Output Token (TFOT)—Similar to the above, this metric tracks the time it takes to generate each subsequent token after the first. This helps understand the model’s efficiency at processing and generating content continuously after it has started, offering insights into its throughput performance.
Prompt/Generation Tokens per Second—This metric measures the number of tokens the model processes or generates per second. It’s an essential metric for assessing the model's throughput capacity. High rates indicate a more efficient model that can handle more input or produce more content in less time.
vLLM exports these metrics (and more) via the /metrics/endpoint, for which we have already configured a scraper in Prometheus. Now, let’s set up a dashboard in Grafana to visualize these metrics and better understand our model’s performance in real time.
Setting up a Grafana dashboard to monitor LLM metrics
- To access the Grafana dashboard through your browser, you first need to port-forward the Grafana pod. From your terminal, enter the following command:
kubectl port-forward -n kube-prometheus-stack \
service/kube-prometheus-stack-grafana 8080:802. Open a web browser and navigate to http://localhost:8080. The login page should appear. Log in with the username adminand the default password prom-operator
3. Once logged in, Click on the “+” icon on the top right bar and select “Import dashboard.”
4. Upload the JSON file named grafana-dashboard.jsonfrom the root folder of the GitHub repository and click “Import.”
5. A dropdown filter at the top left of the dashboard allows you to select the specific model you want to monitor:
Tear down the environment
When you’re ready to remove the setup and release the resources, follow these steps:
- Remove the vLLM deployment
kubectl delete -f vllm.yaml2. Now, use Terraform to destroy all the created infrastructure resources. Run the following command:
terraform destroyConclusion
In this post, we’ve walked through setting up a self-hosted LLM that offers significant benefits, particularly cost savings and data control. This setup is especially beneficial when you don’t necessarily require the most advanced models, such as GPT-4, and when smaller, less resource-intensive models will suffice.
Please note that the demo we ran is not intended for production use. Continuous monitoring of the cluster's health is crucial for implementing this setup in a production environment. Additionally, implementing ingress and scaling policies is essential to manage load and maintain service availability effectively.
For use cases that demand larger models that need to run across several nodes, I highly recommend using KubeRay (Ray operator). KubeRay significantly eases the scaling and management of complex distributed systems. If there’s interest, I’m ready to dive deeper into leveraging KubeRay for large-scale deployments in a future post—just let me know in the comments if that’s something you’d like to see!
