JARK Stack for Generative AI, Part 6

Scalable Model Inference

Shi Jin

Published in

Generative AI

9 min readJan 23, 2024

This series is composed of six parts:

Part 1: Introduction
Part 2: Build the JARK Stack with Terraform
Part 3: Autoscaling with Karpenter
Part 4: Parallel Computing with KubeRay
Part 5: Generative AI on JARK
Part 6: Scalable Model Inference (this article)

Inference: Crucial Transition from Training to Production

After numerous iterations of training and fine-tuning, an AI/ML model is finally ready for production release. At this stage, the primary focus shifts from model development to using the model to perform inference for customers. For many organizations, it is during inference that they start to see a return on investment, making it a critical phase for revenue generation. Furthermore, the resources utilized for inference, particularly GPUs, represent a significant portion of the infrastructure costs associated with AI/ML operations. As such, it becomes imperative for companies to identify and implement a truly scalable solution capable of managing the demands of inference infrastructure efficiently.

Leveraging KubeRay’s RayService for Scalable AI Inference

Ray Serve is a prominent AI library built atop the Ray framework, designed to provide scalable model inference APIs. It seamlessly integrates within Kubernetes through the KubeRay operator. The operator introduces a custom resource definition known as RayService, which amalgamates Ray Serve applications with the underlying RayCluster into a unified resource. This union simplifies the management and scaling of inference workloads.

Taking advantage of KubeRay’s autoscaling capabilities can pave the way for developing a truly scalable end-to-end inference API service. In this article, we will build upon the fine-tuned model from part 5 by orchestrating an auto-scaling API service on Amazon Elastic Kubernetes Service (EKS). We’ll demonstrate the concurrent serving of both the original stable diffusion model and our DreamBooth fine-tuned version, showcasing the flexibility of the system without the need for any additional code.

The Python Application Code

It is truly remarkable how a succinct piece of Python code can facilitate inferences on two different stable diffusion models with such ease. The required code is contained within a single file, stable_diffusion.py, which is composed of under 70 lines. Among these, there are only a few critical lines that underpin its functionality:

model_path = os.environ.get("MODEL_PATH", "/data/tmp/model-tuned")

This crucial line instructs the program to load the stable diffusion model from a directory specified by the MODEL_PATH environment variable. As we will explore later, this variable is conveniently configured via the RayService definition.

    @app.get(
        "/imagine",
        responses={200: {"content": {"image/png": {}}}},
        response_class=Response,
    )
    async def generate(self, prompt: str, img_size: int = 512):

Here we see the FastAPI framework in action, implementing a GET endpoint at /imagine which returns the dynamically generated image in PNG format. By utilizing the power of FastAPI, we make image generation accessible with minimal effort, highlighting the efficacy of the API design.

Understanding the RayService Configuration

The inference.ymlfile is a critical Kubernetes manifest for deploying the RayCluster and orchestrating our Ray Serve applications. This manifest is divided into two main sections.

In the first section, the file defines configurations for two Ray Serve applications:

spec:
  serveConfigV2: |
    applications:
      - name: original_stable_diffusion
        route_prefix: /original
        import_path: 05_dreambooth_finetuning.rayservice.stable_diffusion:entrypoint
        runtime_env:
          working_dir: "https://github.com/jinzishuai/jark-demo/archive/master.zip"
          env_var:
            - name: MODEL_PATH
              value: "/home/ray/efs/src/jark-demo/05_dreambooth_finetuning/data/model-orig/models--stabilityai--stable-diffusion-2-1/snapshots/5cae40e6a2745ae2b01ad92ae5043f95f23644d6"


...

This excerpt demonstrates the setup for the original_stable_diffusion service, which includes routing details and environmental configuration. Here, route_prefix determines the REST API's entry point (/original/imagine), while MODEL_PATH specifies the model's file location on disk.

      - name: tuned_dreambooth
        route_prefix: /tuned
...

The tuned_dreambooth application mirrors the configuration of original_stable_diffusion with alterations to its route_prefix and MODEL_PATH. This ensures the fine-tuned model is accessible via a different REST API endpoint (/tuned/imagine).

The import_path and working_dir configurations direct KubeRay to download the specified branch of a specified code repository, set up a temporary working directory, and execute the entrypoint statement found in 05_dreambooth_finetuning/rayservice/stable_diffusion.py.

The second section of the manifest presents the Ray cluster configuration, advancing the setup we previously used by introducing autoscaling capabilities:

  rayClusterConfig:
    rayVersion: "2.9.0"
    enableInTreeAutoscaling: true
...

With enableInTreeAutoscaling set to true, an autoscaler sidecar container is integrated into the head node pod. This container meticulously manages scaling operations, adjusting the number of active worker nodes within the defined limits:

    workerGroupSpecs:
      - replicas: 0
        minReplicas: 0
        maxReplicas: 2

With this configuration, the cluster can scale between 0 and 2 worker pods, adapting dynamically to the workload demands.

Running the RayService Demonstration

To initiate the demo, we simply deploy the RayService using the inference.yml manifest:

╭─   ~/src/jinzishuai/jark-demo/05_dreambooth_finetuning/rayservice   main *1 ?5 ··································································  base ⎈ jark-stack  08:15:25
╰─❯ kubectl apply -f inference.yml 
rayservice.ray.io/dreambooth-inference created

This commands Kubernetes to instantiate all required resources within the ray-gpu namespace:

╰─❯ kubectl -n ray-gpu get rayservice,raycluster,service,pod,nodes -o wide
NAME                                     AGE
rayservice.ray.io/dreambooth-inference   2m29s

NAME                                                      DESIRED WORKERS   AVAILABLE WORKERS   STATUS   AGE     HEAD POD IP      HEAD SERVICE IP
raycluster.ray.io/dreambooth-inference-raycluster-x6k2n                                         ready    2m27s   100.64.195.137   172.20.170.198

NAME                                                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                                   AGE     SELECTOR
service/dreambooth-inference-head-svc                    ClusterIP   172.20.115.77    <none>        10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   2m6s    app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=dreambooth-inference-raycluster-x6k2n,ray.io/identifier=dreambooth-inference-raycluster-x6k2n-head,ray.io/node-type=head
service/dreambooth-inference-raycluster-x6k2n-head-svc   ClusterIP   172.20.170.198   <none>        10001/TCP,8265/TCP,52365/TCP,6379/TCP,8080/TCP,8000/TCP   2m28s   app.kubernetes.io/created-by=kuberay-operator,app.kubernetes.io/name=kuberay,ray.io/cluster=dreambooth-inference-raycluster-x6k2n,ray.io/identifier=dreambooth-inference-raycluster-x6k2n-head,ray.io/node-type=head
service/dreambooth-inference-serve-svc                   ClusterIP   172.20.101.124   <none>        8000/TCP                                                  2m6s    ray.io/cluster=dreambooth-inference-raycluster-x6k2n,ray.io/serve=true

NAME                                                   READY   STATUS    RESTARTS   AGE     IP               NODE                            NOMINATED NODE   READINESS GATES
pod/dreambooth-inference-raycluster-x6k2n-head-49dlg   2/2     Running   0          2m28s   100.64.195.137   ip-100-64-177-12.ec2.internal   <none>           <none>

NAME                                 STATUS   ROLES    AGE   VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
node/ip-100-64-177-12.ec2.internal   Ready    <none>   73m   v1.28.5-eks-5e0fdde   100.64.177.12   <none>        Amazon Linux 2   5.10.205-195.804.amzn2.x86_64   containerd://1.7.2
node/ip-100-64-32-74.ec2.internal    Ready    <none>   73m   v1.28.5-eks-5e0fdde   100.64.32.74    <none>        Amazon Linux 2   5.10.205-195.804.amzn2.x86_64   containerd://1.7.2

With no initial GPU nodes spun up, we establish two proxy connections to interface with the REST API and Ray dashboard:

kubectl -n ray-gpu port-forward service/dreambooth-inference-head-svc 8265
kubectl -n ray-gpu port-forward service/dreambooth-inference-serve-svc 8000

Before commencing any tasks, the Ray Cluster should show zero worker nodes. The Ray dashboard at http://localhost:8265/ will confirm this, revealing only the head node running the APIIngress processes.

Ray Cluster with a Single Node After Creation

We can also check the Serve tab and confirm the StableDiffusionV2 deployment has 0 repilca at this time.

Ray Serve: StableDiffusionV2 with 0 Replica

Now let’s perform an inference on the original stable diffusion model by visiting URL http://localhost:8000/original/imagine?prompt=”a dog on paddle board in Moraine Lake, Alberta”. Note that the URL will be converted by the browser to be something like http://localhost:8000/original/imagine?prompt=%22a%20dog%20on%20paddle%20board%20in%20Moraine%20Lake,%20Alberta%22

The browser will spin for a while since behind the scene Ray is requesting a new worker pod and Karpenter will meet that request with a new g5.xlarge instance. We can check what’s happening by

╰─❯ kubectl -n ray-gpu get pod,nodeclaim,nodes -o wide
NAME                                                          READY   STATUS     RESTARTS   AGE     IP               NODE                            NOMINATED NODE   READINESS GATES
pod/booth-inference-raycluster-x6k2n-worker-gpu-group-x64bz   0/1     Init:0/1   0          2m10s   <none>           ip-100-64-99-95.ec2.internal    <none>           <none>
pod/dreambooth-inference-raycluster-x6k2n-head-49dlg          2/2     Running    0          16m     100.64.195.137   ip-100-64-177-12.ec2.internal   <none>           <none>

NAME                               TYPE        ZONE         NODE                           READY   AGE    CAPACITY    NODEPOOL   NODECLASS
nodeclaim.karpenter.sh/gpu-vcvx6   g5.xlarge   us-east-1c   ip-100-64-99-95.ec2.internal   True    2m9s   on-demand   gpu        gpu

NAME                                 STATUS   ROLES    AGE   VERSION               INTERNAL-IP     EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
node/ip-100-64-177-12.ec2.internal   Ready    <none>   87m   v1.28.5-eks-5e0fdde   100.64.177.12   <none>        Amazon Linux 2   5.10.205-195.804.amzn2.x86_64   containerd://1.7.2
node/ip-100-64-32-74.ec2.internal    Ready    <none>   87m   v1.28.5-eks-5e0fdde   100.64.32.74    <none>        Amazon Linux 2   5.10.205-195.804.amzn2.x86_64   containerd://1.7.2
node/ip-100-64-99-95.ec2.internal    Ready    <none>   86s   v1.28.5-eks-5e0fdde   100.64.99.95    <none>        Amazon Linux 2   5.10.192-183.736.amzn2.x86_64   containerd://1.7.2

The Ray Serve dashboard also shows the upscaling in progress

In a few minutes, we should see an image similar to this in the browser

Example of Inference on Original Stable Diffusion Model

We can try again with different prompts and we’ll get reponses very quickly.

Similarly, we can activate the fine-tuned model by visiting http://localhost:8000/tuned/imagine?prompt=”a [v] dog on paddle board in Moraine Lake, Alberta”

Example of Inference on the Fine-tuned Model

At this point, our infrastructure has scaled to have two g5.xlarge GPU instances in the EKS and Ray cluster.

╰─❯ kubectl -n ray-gpu get pod,nodeclaim,nodes -o wide
NAME                                                          READY   STATUS    RESTARTS   AGE     IP               NODE                             NOMINATED NODE   READINESS GATES
pod/booth-inference-raycluster-m549l-worker-gpu-group-tmzv6   1/1     Running   0          7m15s   100.64.131.195   ip-100-64-203-155.ec2.internal   <none>           <none>
pod/booth-inference-raycluster-m549l-worker-gpu-group-w566w   1/1     Running   0          19m     100.64.75.253    ip-100-64-115-5.ec2.internal     <none>           <none>
pod/dreambooth-inference-raycluster-m549l-head-x96q2          2/2     Running   0          19m     100.64.224.97    ip-100-64-177-12.ec2.internal    <none>           <none>

NAME                               TYPE        ZONE         NODE                             READY   AGE     CAPACITY    NODEPOOL   NODECLASS
nodeclaim.karpenter.sh/gpu-6bjpv   g5.xlarge   us-east-1c   ip-100-64-115-5.ec2.internal     True    19m     on-demand   gpu        gpu
nodeclaim.karpenter.sh/gpu-c9s8f   g5.xlarge   us-east-1d   ip-100-64-203-155.ec2.internal   True    7m14s   on-demand   gpu        gpu

NAME                                  STATUS   ROLES    AGE     VERSION               INTERNAL-IP      EXTERNAL-IP   OS-IMAGE         KERNEL-VERSION                  CONTAINER-RUNTIME
node/ip-100-64-115-5.ec2.internal     Ready    <none>   18m     v1.28.5-eks-5e0fdde   100.64.115.5     <none>        Amazon Linux 2   5.10.192-183.736.amzn2.x86_64   containerd://1.7.2
node/ip-100-64-177-12.ec2.internal    Ready    <none>   151m    v1.28.5-eks-5e0fdde   100.64.177.12    <none>        Amazon Linux 2   5.10.205-195.804.amzn2.x86_64   containerd://1.7.2
node/ip-100-64-203-155.ec2.internal   Ready    <none>   6m38s   v1.28.5-eks-5e0fdde   100.64.203.155   <none>        Amazon Linux 2   5.10.192-183.736.amzn2.x86_64   containerd://1.7.2
node/ip-100-64-32-74.ec2.internal     Ready    <none>   151m    v1.28.5-eks-5e0fdde   100.64.32.74     <none>        Amazon Linux 2   5.10.205-195.804.amzn2.x86_64   containerd://1.7.2

So far, we’ve seen how the RayService automatically scales out when load increases. When the load subsides, and the service idles, we will observe the RayService automatically scale down, terminating worker pods and consequently Karpenter would remove the provisioned GPU instances shortly after.

Note: To maintain seamless user experiences in production, it’s advisable to retain a minimum number of worker pods active. This strategy can help mitigate the latency associated with new pod initialization, particularly when dealing with large Docker images.

Summary: Scalability with KubeRay and Karpenter on EKS

This article showcased the impressive scalability achievable with KubeRay and Karpenter on EKS when applied to AI model inferencing. The seamless orchestration between these tools underlines the potential for responsive and cost-effective scaling in demanding AI workloads. I hope that the insights shared here have illuminated a path forward for leveraging cloud resources efficiently and effectively in the realm of machine learning operations.

Series Wrap-Up: Reflecting on the JARK Stack for Generative AI Journey

With this final installment, we bring to a close the “JARK Stack for Generative AI” series. Throughout our journey, we have navigated the intricate landscape of cutting-edge AI model deployment, capitalizing on the strengths of the JARK stack within the AWS ecosystem.

Fueled by curiosity, a review of my AWS billing for the duration of these explorations revealed some unexpected insights. In a span of just over a week — a period marked by the active creation and destruction of AWS infrastructure, including GPU instances — the financial tally remained below $100 USD. Intriguingly, it was not the GPU instances but rather the NAT gateway’s network traffic that surged to the forefront of expenditures, consuming over 30% of the overall budget. This finding draws attention to potential cost-saving strategies, such as downsizing Docker images and transitioning to the Amazon Elastic Container Registry (ECR), to tamp down networking costs.

Top AWS Cost Based on Usage Type for Writing This Blog Series

Yet, the key takeaway extends beyond mere numbers. This experience stands as a testament to the economic agility possible through dynamic cloud resource management. When considering the sizeable weekly expense typically associated with just a pair of g5.xlarge GPU instances — upwards of $300 USD — our frugal footprint underscores the fiscal prudence we’ve tapped into.

This series was never intended to serve as a comprehensive dissection of every technical substrate but rather as a starting point — an appetizer to whet the appetite and spark intrigue. It was designed to prompt deeper dives and further investigations into this robust technological stack.

As we conclude, your insights and experiences are invaluable — I warmly welcome any commentary, critique, or reflections on the blog posts or code shared throughout this series. Let’s continue to learn, discover, and innovate together in the ever-expanding domain of generative AI.

This story is published on Generative AI. Connect with us on LinkedIn and follow Zeniteq to stay in the loop with the latest AI stories. Let’s shape the future of AI together!

Generative AI

JARK Stack for Generative AI, Part 6

Scalable Model Inference

Table of Contents

Inference: Crucial Transition from Training to Production

Leveraging KubeRay’s RayService for Scalable AI Inference

The Python Application Code

Understanding the RayService Configuration

Running the RayService Demonstration

Summary: Scalability with KubeRay and Karpenter on EKS

Series Wrap-Up: Reflecting on the JARK Stack for Generative AI Journey

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in Generative AI

Written by Shi Jin

No responses yet