Deploy your own “ChatGPT”

19 min readApr 4, 2023

Deploy an open-source equivalent to ChatGPT on your laptop, then transform it into a highly-available application using Kubernetes

Introduction
↳ Run a ChatGPT equivalent using Docker
Kubernetes
↳ Basic idea
↳ K8s terminology
↳ The basic unit of computation: the Pod
↳ Structure of a k8s cluster
Hands-on
↳ Minikube or your own cluster
↳ Installing minikube
↳ Don’t be afraid
↳ Our first pod
↳ Define an auto-scaling application
↳ Testing the autoscaling
↳ Deploying our “ChatGPT” to Kubernetes
↳ Opening a shell into a running pod/container
Conclusions

IMPORTANT: make sure you upgrade your docker to the latest version, and if you already have minikube, also upgrade that. Otherwise you will get an error when trying to download the docker image we are using.

Introduction

You might already be familiar with containers and technologies like Docker. If not, think of containers as isolated environments that package an application’s code, runtime, libraries, and configurations, ensuring that the application runs consistently across different computing environments.

In this blog, we will see how we can run an app locally using docker (we will play with an open-source ChatGPT equivalent). We will then see how it is easy to deploy an app at scale on Kubernetes. We will initially play with a minimal web server and see how Kubernetes can auto-scale it to be highly available even under heavy traffic. Finally, we will bring everything together and deploy our open-source ChatGPT equivalent to Kubernetes.

Before we get started, if you don’t have Docker installed already, install docker-engine or docker desktop or even an open-source equivalent like podman.

Run a ChatGPT equivalent using Docker

We can deploy a running, easy-to-use open-source equivalent to ChatGPT with this simple docker command:

docker run -it -p 7860:7860 \
     registry.hf.space/olivierdehaene-chat-llm-streaming:cpu-6728f78 \
     python app.py

using the new awesome Docker integration of Huggingface. This is a local deployment of this HuggingFace space.

Docker will initially have to download quite a bit of data, so be patient. After everything is complete you should see a message like this:

Running on local URL:  http://0.0.0.0:7860

You can now put that URL in your browser and you will see a Gradio app like this:

Just put your prompt in the Input form and click Run, and you’re ready to go!

This however is only good for your personal use. What if you want to make it available to your team, or your customers?

Consider the license before using any of this material for commercial applications!

Kubernetes

Running a container or an app locally is fun and useful for testing and your personal use. But of course, if you need resources that aren’t available on your laptop, or if you want/need to share your app with your team or your customers you need to deploy it on a platform that can handle your application robustly. Enter Kubernetes.

Kubernetes, often abbreviated as K8s, is an open-source container orchestration platform that automates the deployment, scaling, and management of containerized applications in a heterogeneous cluster of computing nodes. Examples of applications are websites, databases, machine learning models (endpoints), machine learning pipelines, inference jobs …

If you are an experiential learner, feel free to skip directly to the Hands-on section down below and come back to this later.

Kubernetes (which means helmsman in greek) builds on the container concept and provides a robust platform for managing containers at scale. Some of its key features include:

Cluster management: Kubernetes groups multiple machines (nodes) into a cluster, providing high availability and fault tolerance. A cluster typically consists of one or more master nodes, which control the overall state of the cluster, and multiple worker nodes, which run the actual containerized applications.
Container orchestration: Kubernetes automates the deployment, scaling, and management of containerized applications. It can automatically distribute containers across the nodes in the cluster, restarts failed containers, and scale applications up or down based on demand. For example, if a website is receiving a lot of users at once, K8s will scale it up (i.e., provide more computing resources) to support that demand. When the peak usage has finished, K8s will scale the website back down to save resources.
Load balancing and service discovery: Kubernetes can automatically expose applications running in containers to the outside world or other containers within the cluster. It can also perform load balancing, which means distributing network traffic or workload evenly across multiple servers to prevent any single server from becoming a bottleneck, ensuring optimal performance and availability of the application.
Rolling updates and rollbacks: Kubernetes allows you to perform rolling updates of your applications, ensuring that new versions are rolled out gradually, with no downtime. If an update fails, Kubernetes can automatically roll back to the previous version.
Storage orchestration: Kubernetes can automatically mount different types of storage systems, such as local storage, network storage, or cloud-based storage, to containers as needed.
Self-healing: Kubernetes monitors the health of nodes and containers and takes action when it detects issues. For example, if a container fails, Kubernetes can restart it automatically. If a node becomes unresponsive, Kubernetes can reschedule the containers running on that node to other nodes in the cluster.
Configuration and secret management: Kubernetes allows you to store and manage sensitive data, such as passwords and API keys, as secrets. These secrets can then be securely mounted into containers and accessed by applications as needed.

While there are other technologies with similar characteristics, K8s has become the de-facto standard for container orchestration.

Basic idea

Kubernetes differentiates between the desired state of an application and the infrastructure needed to make that happen. The desired state is defined by the user, while the infrastructure is handled automatically by K8s.

Users specify things like container requirements, communication needs, storage, computing, and scaling limits, and leave the implementation to Kubernetes. For example, a user will specify a web application composed of 3 components that need to communicate with each other: a website, a database, and a chatbot, each with its requirements. Moreover, the application needs to be exposed to the internet for users to be able to interact with it. K8s will find the optimal way to deploy these containers on the existing infrastructure so that all requirements are satisfied. It will also provide the storage, and the credentials, and put the networking in place so that the containers can communicate with each other, and expose the application externally.

The user can modify the requirements while the application is already deployed. K8s will gradually modify the state of the running application to match the new desired state, avoiding any downtime.

K8s terminology

Let’s define the most common objects that you are going to interact with within the context of K8s:

Containers: Isolated environments that package an application’s code, dependencies, and configurations, ensuring consistent behavior across different platforms.
Pods: The smallest and simplest units in Kubernetes, consisting of one or more containers working together and sharing storage and network resources.
Deployments: High-level abstractions that represent the desired state of your application, automating the scaling, updating, and rollback of containerized applications.
Jobs: A high-level abstraction that represents a finite, one-time task or a batch of work. Jobs ensure that a specified number of Pods complete their work, and then terminate. Unlike Deployments, which are intended for long-running, continuously available applications, Jobs are designed for tasks that run to completion and then exit, such as batch processing or data transformation.
StatefulSet: A high-level abstraction designed to manage stateful applications, which require stable network identities and persistent storage across restarts and rescheduling. StatefulSets are ideal for applications that require stable network identities, consistent data storage, and strict ordering, such as databases, message brokers, or distributed data processing systems.
Services: A stable network interface that abstracts away the underlying Pod IPs, enabling load balancing, service discovery, and access to applications within a cluster or externally.
Volume: Represents a data directory for reading and/or writing. Volumes can be ephemeral (they disappear when a pod finishes or is stopped) or persistent. They can also be local (accessible only by containers within the same pod) or shared across pods and even across nodes.
Namespace: An that separates the cluster into isolated units. It is common to deploy different applications in different namespaces.
ConfigMaps & Secrets: Used to inject data into containers by separating their definitions from pod specs.
Ingress: A set of rules that manage external access to services within a cluster, typically handling HTTP(S) traffic routing, load balancing, and SSL/TLS termination.
Autoscaler: Kubernetes adopts horizontal autoscaling. This means that when a pod approaches its resource limits (for example on CPU load), the auto-scaler creates a new replica (a copy of the pod) and deploys it. Then the load is distributed between the replicas. It is common for cloud Kubernetes clusters (for example EKS, GKE, and AKS) to have a horizontal auto scaler capable of creating new cloud instances on demand, effectively giving you access to a potentially unlimited amount of resources (be careful with the cost!). Keep in mind that creating a new node/instance can require several minutes in these cases. Be also careful to understand the trade-offs between using on-demand or reserved instances.

For details, please refer to the Kubernetes documentation.

The basic unit of computation: the Pod

The Pod is the fundamental building block that represents the smallest deployable unit of computation. A Pod encapsulates one or more containers, usually closely related, that work together and share the same network namespace and storage resources. This means that containers within a Pod can communicate with each other using ‘localhost’ and access shared volumes for data exchange. While each container runs in isolation with its file system, runtime, and dependencies, Pods allow these containers to work in concert, simplifying the orchestration of multi-container applications. In a Kubernetes cluster, Pods are ephemeral and can be rescheduled or replaced when needed, making them suitable for stateless applications or temporary tasks. To ensure stability and persistence for stateful applications, higher-level abstractions like StatefulSets and/or persistent storage mechanisms are used.

When a Pod is created, the Kubernetes scheduler selects an appropriate node based on various factors, such as resource requirements, node availability, and user-defined constraints. The scheduler evaluates each node’s available resources, such as CPU and memory, ensuring that the chosen node has enough capacity to run the Pod. Additionally, user-defined constraints, such as node labels or node affinity rules, influence the scheduler’s decision, allowing more granular control over Pod placement. Once the scheduler determines the most suitable node, the Pod is deployed on that node and begins executing its containers.

Structure of a k8s cluster

A Kubernetes cluster is a group of interconnected machines, or nodes, that work together to orchestrate containerized applications. The cluster is composed of two types of nodes: those belonging to the control plane (also known as the master nodes) and the worker nodes. The control plane manages the overall state of the cluster, handles configuration data, and schedules workloads, while the worker nodes host the actual containerized applications (Pods).

Schematic representation of a minimal Kubernetes cluster. Here for simplicity, we have a Control Plane made of only one node, but in practice, a K8s cluster can have multiple nodes in the Control Plane. This is needed for high availability.

Hands-on

There’s no better way to understand things than by trying them out!

Minikube or your own cluster

In this tutorial, we will use Minikube, a nice tool that allows you to create a fully-fledged Kubernetes cluster on your laptop. This gives us a nice sandbox to try things out without spending any money on cloud resources (nor time on the configuration).

However, if you have access to a normal Kubernetes cluster (or you want to test one out for free) you can use that as well. Just skip the minikube commands in this tutorial and make sure you have installed and configured a recent version of kubectl to work with your cluster.

Installing minikube

(skip this if you already have a k8s cluster)

First, we need to install Minikube by following the instructions for your system (however, DO NOT start the cluster yet!).

You will also needkubectl, which is the command-line tool we will use to interact with Kubernetes. You can use the one that comes with Minikube. It can be executed as minikube kubectl. However, to avoid having to write that every time we can create an alias:

> alias kubectl="minikube kubectl --"

You might want to add that line to your shell init script (or remember to redefine the alias for every new shell).

We can now start a simulated multi-node cluster with the following command:

> minikube start --nodes 2 -p multinode-demo
> minikube profile multinode-demo  # set default minikube profile

This will run for a few minutes: minikube needs to download a few docker images and do some setup. After it’s done, let’s check if things are ready:

> kubectl get nodes   # get the status of the nodes in the cluster
NAME                 STATUS     ROLES           AGE   VERSION
multinode-demo       Ready      control-plane   59s   v1.24.3
multinode-demo-m02   NotReady   <none>          23s   v1.24.3

We see here that the second node is not yet ready. Let’s give it a bit more time, after a while, you should see this:

> kubectl get nodes
NAME                 STATUS   ROLES           AGE   VERSION
multinode-demo       Ready    control-plane   92s   v1.24.3
multinode-demo-m02   Ready    <none>          56s   v1.24.3

We now have two nodes up and running. If you wanna see the details of the nodes, just run kubectl describe node [node name].

Hint: in general the kubectl tool has a very straightforward syntax. For example, in the same way that kubectl get nodes list the nodes and kubectl describe nodegives you the details,kubectl get pods and kubectl describe pod [pod name]lists and describes the pods, kubectl get deployments and kubectl describe deployment [deployment name]the deployments and so on

Don’t be afraid

The following YAML definitions might seem a lot to take in. Do not worry if you don’t understand every single detail, but do read them from beginning to end. The YAML syntax is pretty self-explanatory and you should get an intuition of what is happening. We will call out the important details.

Our first pod

Let’s define a minimal “hello world” pod using the YAML syntax and save it into hello-world-pod.yaml:

apiVersion: v1
kind: Pod
metadata:
  name: hello-world-pod
spec:
  containers:
  - name: hello-world-container
    image: busybox
    command: ['sh', '-c', 'echo Hello, World! && sleep 3600']

We are defining a pod named hello-world-podwith only one container. This container executes a shell and runs the command echo Hello, World! in it, then sleeps for 3600 seconds to keep the pod running and give us a chance to try a few things.

Let’s go ahead and submit that to the cluster using the apply command. The apply command takes a definition of a new state and applies it to the cluster. Immediately after the command, K8s will change the state of the cluster to match the new desired state. In this case, we’re just adding a pod:

> kubectl apply -f hello-world-pod.yaml

We can now see the list of active pods:

> kubectl get pods
NAME              READY   STATUS              RESTARTS   AGE
hello-world-pod   0/1     ContainerCreating   0          5s

After a while we should see that the pod is running:

> kubectl get pods
NAME              READY   STATUS    RESTARTS   AGE
hello-world-pod   1/1     Running   0          70s

We can then see its log:

> kubectl logs hello-world-pod
Hello, World!

As before, you can also use the command kubectl describe pod hello-world-pod to get more details about the pod, including the node on which the pod is running. At the end of the output, you can see the log of the events belonging to this pod. This is often useful to see what is happening before the container(s) start executing:

> kubectl describe pod hello-world-pod
...
Events:
  Type    Reason     Age    From               Message
  ----    ------     ----   ----               -------
  Normal  Scheduled  4m2s   default-scheduler  Successfully assigned default/hello-world-pod to multinode-demo-m02
  Normal  Pulling    4m2s   kubelet            Pulling image "busybox"
  Normal  Pulled     3m59s  kubelet            Successfully pulled image "busybox" in 3.398927802s
  Normal  Created    3m58s  kubelet            Created container hello-world-container
  Normal  Started    3m58s  kubelet            Started container hello-world-container

We can now stop and delete the pod:

> kubectl delete pod hello-world-pod

Define an auto-scaling application

We will now define a deployment, and test its autoscaling capabilities. Minikube does not ship with auto-scaling capabilities by default. We need to manually install the Metric Server instead, which is the service in the cluster that collects resource utilization metrics so that K8s can decide when to scale up and down our deployment:

> minikube addons enable metrics-server

Let’s now define our Deployment, i.e., an application that is meant to be always on.

For the moment, we want to explore the autoscaling capabilities of Kubernetes so we will deploy the Web Server NGINX in a pod with one container. Later we will similarly deploy the open-source ChatGPT.

In our definition, we will also declare the needed CPU resources as 100m , i.e., 100 millicores, or 0.1 CPUs. Let’s save this into webapp-deployment.yaml :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webapp-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: webapp
  template:
    metadata:
      labels:
        app: webapp
    spec:
      containers:
      - name: webapp-container
        image: nginx
        resources:
          limits:
            cpu: 100m
          requests:
            cpu: 100m

Here we are defining a Deployment containing 1 replica by default. It contains a template including the description of our pod and its metadata. The structure of the templateis identical to the structure of the pod we declared earlier. It is called “template” because this pod template definition will be used over and over when the cluster scales up or down our application. As part of our container definition, we are also adding the specification of the needed resources (limits and requests). Requests are used for scheduling decisions, while limits set an upper boundary on resource consumption for a container within a pod. In the case of the CPU limit, if the container exceeds its limit it gets throttled. If there was a RAM resource limit, exceeding that might result in the pod getting killed.

Let’s also define the auto-scaling behavior of the application. This defines how K8s will behave if the load on the application increases or decreases, for the application to support the workload without failing or slowing down. We accomplish this by creating a HorizonthalPodAutoscaler(HPA). Save the following YAML in webapp-hpa.yaml:

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: webapp-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: webapp-deployment
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 60

Note how we are specifying the deployment we defined earlier as the target of our auto-scaling (using scaleTargetRef). We also declare that we want at minimum 1 replica and at maximum 10, which means that we will get a maximum of 10 copies of our application running concurrently. We are declaring that the criterium for scaling up or down is the utilization of the CPU: if the pods in the deployment use more than 60 percent of the CPU resources they have requested, a new replica will be created unless we have already reached the maximum (scale up). Similarly, if the usage drops below that, then one of the replicas will be shut down (scale down). Optimizing the target value is important to reach a good trade-off between resource utilization and application responsiveness. Since auto-scaling is not immediate, increasing the value might result in an application that responds slowly under load while new replicas are created. Decreasing the value however, means that we will have underutilized hardware resources (which can be expensive).

Finally, we need to declare that we want to expose our NGINX Web Server to the external world (outside of the cluster), so we can reach it for example from our browser. We need to define a LoadBalancer that exposes our deployment to external traffic, balancing the incoming traffic onto the existing replicas so each replica is subject to a similar load. Let’s call the file webapp-service.yaml:

apiVersion: v1
kind: Service
metadata:
  name: webapp-service
spec:
  selector:
    app: webapp
  ports:
    - protocol: TCP
      port: 80
      targetPort: 80
  type: LoadBalancer

Note how we use a selector that matches the name of our deployment, so the load is distributed across replicas of our deployment.

Let’s now apply the 3 changes:

> kubectl apply -f webapp-deployment.yaml
> kubectl apply -f webapp-hpa.yaml
> kubectl apply -f webapp-service.yaml

We can now verify that our deployment, HPA, and service are all there and running (note that it might take a few minutes for everything to be ready):

> kubectl get deployments
NAME                READY   UP-TO-DATE   AVAILABLE   AGE
webapp-deployment   1/1     1            1           11m

> kubectl get hpa
NAME         REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
webapp-hpa   Deployment/webapp-deployment   1%/50%    1         10        1          14m

> kubectl get services
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubernetes       ClusterIP      10.96.0.1        <none>        443/TCP        8h
webapp-service   LoadBalancer   10.105.204.116   <pending>     80:32432/TCP   4m30s

In the case of the webapp-service we can see that the EXTERNAL-IP is <pending>. This is because minikube cannot expose things to the outside world (the internet). So for minikube we can instead run minikube service [service name] --url which will print a URL we can copy and paste into the browser:

> minikube service webapp-service --url
http://192.168.49.2:32432

NOTE: of course your URL will look different, so copy/paste yours

Testing the autoscaling

Let’s save the URL in an environment variable:

> export NGINX_URL=$(minikube service webapp-service --url)

To test our autoscaling we will use hey, which is a tool that can be used to send requests in bulk. It can be installed on Linux with apt install hey or snap install hey and on Mac with brew install hey. Let’s send 100,000 requests to our application with a concurrency of 100, which should send the CPU utilization above the mark and trigger the scale-up.

NOTE: your result will vary from mine, as they depend on the compute power of your system and on the load. You might need to adjust the numbers in the command line of hey to match the power of your laptop

Execute this:

> hey -n 100000 -c 100 $NGINX_URL > report.txt &

This will run in the background and save the results to report.txt. While the requests keep coming, let’s observe the behavior of our auto-scaler. We will use kubectland add the --watch option (abbreviated to -w) to check periodically our HPA:

> kubectl get -w hpa
NAME         REFERENCE                      TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
webapp-hpa   Deployment/webapp-deployment   0%/50%    1         10        1          34m
webapp-hpa   Deployment/webapp-deployment   96%/50%   1         10        1          35m
webapp-hpa   Deployment/webapp-deployment   96%/50%   1         10        2          35m
webapp-hpa   Deployment/webapp-deployment   100%/50%   1         10        2          36m
webapp-hpa   Deployment/webapp-deployment   59%/50%    1         10        2          37m
webapp-hpa   Deployment/webapp-deployment   59%/50%    1         10        3          37m
webapp-hpa   Deployment/webapp-deployment   28%/50%    1         10        3          38m
webapp-hpa   Deployment/webapp-deployment   0%/50%     1         10        3          39m
webapp-hpa   Deployment/webapp-deployment   0%/50%     1         10        3          42m
webapp-hpa   Deployment/webapp-deployment   0%/50%     1         10        2          43m
webapp-hpa   Deployment/webapp-deployment   0%/50%     1         10        2          43m
webapp-hpa   Deployment/webapp-deployment   0%/50%     1         10        1          44m

...Crtl+C

Observing the TARGETS column we can see that the average CPU utilization goes way up all the way to 100%. Consequently, K8s scales up our application and the REPLICAS increases to 3. Then, when the requests finish, the CPU load goes back down and consequently the REPLICAS go back down to 1. Note how there is a latency between the CPU usage changes and the replicas being generated. This is because the auto scaler runs periodically (and not continuously) and generating a replica also takes a bit of time (depending on the complexity of your application).

Now let’s look at the report from hey, which describes the experience the users of our app would have had:

> cat report.txt
Summary:
  Total:        210.9319 secs
  Slowest:      1.7004 secs
  Fastest:      0.0001 secs
  Average:      0.2008 secs
  Requests/sec: 474.0866
  
  Total data:   61500000 bytes
  Size/request: 615 bytes

Response time histogram:
  0.000 [1]     |
  0.170 [47054] |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.340 [30913] |■■■■■■■■■■■■■■■■■■■■■■■■■■
  0.510 [13146] |■■■■■■■■■■■
  0.680 [3966]  |■■■
  0.850 [3302]  |■■■
  1.020 [1180]  |■
  1.190 [245]   |
  1.360 [143]   |
  1.530 [41]    |
  1.700 [9]     |


Latency distribution:
  10% in 0.0004 secs
  25% in 0.0007 secs
  50% in 0.1984 secs
  75% in 0.3005 secs
  90% in 0.5005 secs
  95% in 0.6020 secs
  99% in 0.9000 secs

Details (average, fastest, slowest):
  DNS+dialup:   0.0000 secs, 0.0001 secs, 1.7004 secs
  DNS-lookup:   0.0000 secs, 0.0000 secs, 0.0000 secs
  req write:    0.0000 secs, 0.0000 secs, 0.0051 secs
  resp wait:    0.1949 secs, 0.0000 secs, 1.7001 secs
  resp read:    0.0057 secs, 0.0000 secs, 0.8978 secs

Status code distribution:
  [200] 100000 responses

We can see that all requests had an HTTP status of 200 (which means success). On my system, the maximum latency was 1.7 s but for most requests was around 0.2 s. K8s managed to keep the application alive despite the huge spike in traffic!

Deploying our “ChatGPT” to Kubernetes

Now that we understand the basic concepts of Kubernetes, we can easily deploy our “ChatGPT” equivalent. Let’s go back to the definition of our deployment, and let’s change a few details. Let’s save this new file in llm-deployment.yaml :

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-deployment
spec:
  replicas: 1
  selector:
    matchLabels:
      app: llm
  template:
    metadata:
      labels:
        app: llm
    spec:
      containers:
      - name: llm-container
        image: registry.hf.space/olivierdehaene-chat-llm-streaming:cpu-6728f78
        command: [ "python", "app.py" ]
        resources:
          limits:
            cpu: 1000m
          requests:
            cpu: 1000m

Here we have changed the name of the deployment to llm-deployment , the labels to llm, the docker image to the image containing our app (from HuggingFace), and the resources to one entire CPU. We have also added a command section specifying which command should be executed in the container (in this case, python app.py). Note that the command needs to be split into a list.

We can now update our HPA to point to our new llm app (let’s save it in llm-hpa.yaml):

apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
metadata:
  name: llm-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-deployment
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

and our service to point to our llmapp as well (let’s save it in llm-service.yaml):

apiVersion: v1
kind: Service
metadata:
  name: llm-service
spec:
  selector:
    app: llm
  ports:
    - protocol: TCP
      port: 7860
      targetPort: 7860
  type: LoadBalancer

We’re now ready to apply all these new changes (we can do it in one go):

> kubectl apply -f llm-deployment.yaml -f llm-hpa.yaml -f llm-service.yaml

After a while we should see that our service and our deployments are ready:

> kubectl get deployments
NAME             READY   UP-TO-DATE   AVAILABLE   AGE
llm-deployment   1/1     1            1           23m

> kubectl get hpa
NAME      REFERENCE                   TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
llm-hpa   Deployment/llm-deployment   <unknown>/50%   1         10        1          14m

> kubectl get services
NAME             TYPE           CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubernetes    ClusterIP      10.96.0.1       <none>        443/TCP          27m
llm-service   LoadBalancer   10.97.182.103   <pending>     7860:30678/TCP   15m

As before, because we are using minikube we need to do one additional step (which we wouldn’t need with a real k8s cluster):

> minikube service llm-service --url
http://192.168.49.2:30678

We can now point our browser to that address (yours might be different!) and here we go, our open-source ChatGPT is now deployed on Kubernetes!

Opening a shell into a running pod/container

Sometimes looking at logs and using kubectl describe is not enough to understand what is happening. In these cases, if your application/job is still running, you might want to open a shell into a running container to see what is going on.

Right now you should have one or two running pods:

> kubectl get pods
NAME                                 READY   STATUS    RESTARTS      AGE
hello-world-pod                      1/1     Running   1 (54m ago)   114m
webapp-deployment-645f69c6d6-rgvw2   1/1     Running   0             55m

Let’s open a shell into the only existing web app pod and inspect the environment variables:

> kubectl exec -it webapp-deployment-645f69c6d6-rgvw2 -- /bin/sh

# ls
bin  boot  dev  docker-entrypoint.d  docker-entrypoint.sh  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
# env
KUBERNETES_SERVICE_PORT=443
KUBERNETES_PORT=tcp://10.96.0.1:443
HOSTNAME=webapp-deployment-645f69c6d6-rgvw2
HOME=/root
PKG_RELEASE=1~bullseye
TERM=xterm
KUBERNETES_PORT_443_TCP_ADDR=10.96.0.1
NGINX_VERSION=1.23.3
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
KUBERNETES_PORT_443_TCP_PORT=443
NJS_VERSION=0.7.9
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_PORT_443_TCP=tcp://10.96.0.1:443
KUBERNETES_SERVICE_HOST=10.96.0.1
PWD=/
# exit

NOTE: if there is more than one container in a pod, you must specify the container name using the -c flag

Conclusions

In conclusion, we’ve had some fun deploying an open-source “ChatGPT” first by using docker directly and then using Kubernetes. We have seen how K8s can make our applications robust and highly available and explored in particular its autoscaling capabilities.

We have just begun to uncover the power of Kubernetes by exploring its basics. But there’s so much more to learn in this vast ecosystem. With advanced features like persistent storage, ingress controllers, and custom resource definitions, the possibilities are endless.

Kubernetes will play a central role in the upcoming AI revolution, so don’t stop here! Dive deeper into what Kubernetes can do for your application (AI and Machine Learning or any other application), and happy orchestrating!