Demystifying the Process of Building a Ray Cluster

Chris Rempola
Sage Ai
Published in
11 min readAug 24, 2023

Introduction

At Sage AI, we’re pushing the boundaries of technology by building, experimenting with, and leveraging generative AI to innovate financial applications. Among the diverse array of tools available to us, one has caught our attention and is particularly noteworthy: Ray. Our colleague Yu-Cheng Tsai has recently explored its capabilities in a blog post, Fine-Tuning Large Language Models: A Guide into Distributed Parallel Training with DeepSpeed, Ray and Instruction Following.

How did we arrive there? What steps did we take, what challenges did we face, and what valuable lessons did we learn while building the Ray Cluster from an infrastructure perspective?

In this blog post, we will detail our journey. We’ll dive into our experiences, the planning and design, the implementation, and the lessons learned, giving you an inside look into the construction of a Ray cluster.

So what exactly is Ray?

According to Ray’s documentation, Ray is an open-source unified framework for scaling AI and Python applications, including machine learning. It provides the compute layer for parallel processing so that you don’t need to be a distributed systems expert.

Ray provides the user useful abstractions for tackling non-trivial distributed systems problems like; data consistency, fault tolerance, synchronization, distributed compute and storage, and even some of the practical aspects of deploying ray applications. Under the hood, Ray distributes and schedules your tasks across multiple nodes, handling node failures, and other aspects of distributed computing. To put it simply, Ray abstracts away many of the complexities associated with distributed computing.

Planning & Design

KubeRay Operator

Our infrastructure consists of Kubernetes clusters in Amazon EKS, which host our AI applications. The recommended method for installing Ray onto a Kubernetes cluster is through KubeRay, a Kubernetes Operator. The KubeRay Operator allows you to create and manage Ray clusters in a Kubernetes-native way by defining clusters as a custom resource called a RayCluster. This RayCluster resource describes the desired state of a Ray cluster. The KubeRay Operator manages the cluster lifecycle, autoscaling, and monitoring functions. Hence, we use the KubeRay Operator for our Ray cluster installation.

KubeRay Operator managing the Ray Cluster lifecycle.

As the above image depicts, the KubeRay Operator is responsible for bringing up the pods that make up a Ray cluster. A Ray Cluster is composed of the following components:

  • Head Node Pod — All Ray clusters will have a running Head Node Pod which acts as a management controller, responsible for the Ray driver processes that run the Ray jobs. The Head Node Pod also contains the cluster’s GCS (Global Control Store) which is the cluster’s metadata which holds information about the nodes, the resources of the nodes, and placement group information. The worker nodes communicate with GCS often via gRPC.
  • Worker Node Pods — The worker nodes run the user code in Ray tasks and actors. In addition to distributed scheduling, they also contribute to the storage and distribution of Ray objects in cluster memory.
  • Optional Autoscaler — The Ray Autoscaler works alongside the Kubernetes Cluster Autoscaler. Ray’s responsible for creating Ray pods, but it’s still the Kubernetes Cluster Autoscaler’s responsibility to provision the nodes that the Ray pods can be placed on.

Handling Authentication

Before we started, another question we had was: “How does Ray handle authentication to the Ray API?” At present, Ray provides the option to configure TLS Authentication for its gRPC channels. Once this is implemented, a correct set of credentials — in this case, a client certificate — is required to connect to the Ray head and submit jobs. This not only adds an extra layer of security, but also ensures that data exchanged among various processes (such as the client, head, workers, autoscaler) are encrypted.

Multi-tenancy

While Ray offers a robust framework for AI and Python applications, it does have some gaps when it comes to running multi-tenancy in Production. An important one in our case is Ray does not natively support access controls, which means jobs can fully access a Ray cluster and all the resources within it.

We deploy multi-tenant clusters and establish isolation and security through the use of namespaces. Following this model, we deploy a Ray cluster in each tenant namespace. This setup provides resource isolation via namespaces, which is thankfully supported by the KubeRay Operator. You can deploy the operator in a single namespace, such as a “system” namespace and the operator will watch for any requested Ray clusters to deploy in the namespace requested.

Ray Autoscaler

We opted to use the Ray Autoscaler to manage costs. Bringing up nodes, particularly GPU nodes can be costly if they’re left idle. The Ray Autoscaler will automatically scale out nodes when jobs are submitted, and terminates the nodes once they are no longer running any jobs. This allows us to set both the replicas and minReplicas for the workers to 0, ensuring that only the head node pod runs continuously.

The trade-off is the overhead time it takes to bring up a node, but this is acceptable for us as we’ve prioritized reducing resource waste and costs. We did experience some issues with the Ray Autoscaler initially, which I’ll elaborate on in the “Lessons Learned” section.

It’s important to keep in mind that Ray is a rapidly evolving framework. Aspects that are true at the time of this writing may change soon. For instance, I’ve heard in Ray’s Slack channel that a refactor of the Ray Autoscaler is in progress.

Implementation

Deploying the KubeRay Operator

When it comes to deploying the KubeRay Operator on a Kubernetes cluster, you can do this using Helm. The installation process is straightforward and requires minimal configuration. In our deployment of the values.yaml, we are only overriding a few things such as the image field. This allows us to pull images from our own in-house registry. Additionally, we override the securityContext field to provide additional security and access controls for running the pod.

values: 
image:
repository: <CUSTOM_REPO>
tag: v0.5.2
securityContext:
allowPrivilegeEscalation: false
capabilities:
drop:
- ALL
runAsNonRoot: true

Creating a Ray Cluster

Once the KubeRay Operator has been deployed, you are now ready to create a Ray cluster by defining a RayCluster resource. However, since we are enabling TLS authentication, there are some prerequisites we need to set up first:

  • Create a CA certificate: You will need to generate a private key and self-signed certificate for a CA. If you are running cert-manager in your cluster, this can be easily accomplished by creating the following Certificate resource.
apiVersion: cert-manager.io/v1 
kind: Certificate
metadata:
name: raycluster-ca
spec:
commonName: raycluster-ca
isCA: true
issuerRef:
name: selfsigning
kind: ClusterIssuer
secretName: raycluster-ca-cert

Once the CA is created, you’re now prepared to generate individual private keys and self-signed certificates for the Ray head and workers. This process is facilitated by a shell script stored as a ConfigMap, obtainable from the provided YAML. Within the yaml, you’ll see that the shell scripts are designated as gencert_head.sh and gencert_worker.sh.

Upon the creation of a head or worker pod, the scripts are executed in an InitContainer. This action dynamically generates a self-signed certificate, which is shared with the main container via a volume of the produced certificate. An essential point to note here, which may provoke the question, “Why not also generate the certs via cert-manager?” is that the certificates need to include the POD_IP in their alt_names since the POD_IP of the head and workers are utilized for communication also. There isn’t an easy way to accomplish this via cert-manager so consequently, we adhered to the documented procedure for generating the certificates.

We’re now prepared to create a Ray cluster. Using the same YAML, we can deploy a Ray cluster equipped with both Autoscaling and TLS features enabled. Our configuration largely mirrors the original, but with the following modifications:

  • Set the head node pod with {num-cpus: “0”}. This prevents any jobs or workloads from running on the head node pod. As a result, we can tailor the resources of the head node pod to the specific workloads, enhancing efficiency and reducing waste. This is particularly beneficial as your cluster expands, given that GCS and other processes may generate significant network load.
headGroupSpec: 
serviceType: ClusterIP # Options are ClusterIP, NodePort, and LoadBalancer
rayStartParams:
dashboard-host: 0.0.0.0
num-cpus: "0"
  • We also added Fluent-bit as sidecar containers for log collection, a process which I will delve deeper in the ‘Lessons Learned & Gotchas: Logging’ section.
        - name: fluentbit 
image: <IMAGE_SOURCE>
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 125m
memory: 160Mi
volumeMounts:
- mountPath: /tmp/ray
name: ray-logs
- mountPath: /fluent-bit/etc/fluent-bit.conf
subPath: fluent-bit.conf
name: fluentbit-config

You can find more yaml samples here — https://github.com/ray-project/kuberay/tree/master/ray-operator/config/samples

Once you have the full YAML constructed, you can apply it to your cluster and observe as the operator provisions your Ray cluster! Once the head node pod is running, a good way to check that you have a healthy cluster is to open up the Ray Dashboard.

Lessons Learned & Gotchas

Support for a Mixed-node Type Architecture

Our ML practitioners run workloads on a variety of EC2 instance types. For instance, fine-tuning a Large Language Model (LLM) would be executed on the AWS P3 instances, which house Nvidia Tesla V100 GPUs. However, we also utilize other types of EC2 instances for GPU, as well as the typical CPU-only nodes.

During testing, we discovered a lack of support for a mixed-node type architecture within worker groups defined in a single cluster. For example, consider a scenario where you have multiple worker groups for GPU nodes, specifically P3 and G4 EC2 instance types, both possessing different GPU processors. If you want to specifically target the P3 instances running the V100 GPUs only, current support for this seems inadequate.

Consequently, we decided to create a separate Ray cluster for each specific EC2 node type we needed to run workloads on. This setup doesn’t add much overhead since the KubeRay Operator manages all clusters. This approach proved to be simpler: we have a dedicated cluster for each type of workload, a CPU-only cluster, a V100 GPU cluster, and a T4 GPU cluster.

Conflict Between Ray Autoscaler and GitOps

If you’re operating Kubernetes clusters, there’s a fair chance that you’re running software that implements GitOps like Flux or ArgoCD. We’ve discovered that currently, these 2 systems will conflict.

Here’s what happens: in our repository, we’ve set the ‘workerGroupSpecs’ replicas at 0. Now, when the autoscaler alters the replicas to the desired state (for instance, scaling out to 1), Flux reverts the replicas back to the committed state, which is 0. Consequently, the workers initiate and then terminate mid-job with the error message “The actor is dead because its node has died”.

However, it’s worth noting that there’s a proposed fix for this issue in the pipeline. A PR looks to be currently underway that aims to make the replicas optional, which should ideally resolve this conflict. Another way you can get around it is disabling the reconciliation on the controller. For Flux, this can be done with the following annotation:

apiVersion: ray.io/v1alpha1 
kind: RayCluster
metadata:
name: raycluster
annotations:
kustomize.toolkit.fluxcd.io/reconcile: "disabled"

Support for Rolling Upgrades

As of this writing, if you make changes to the RayCluster resources, you must manually terminate the head node pod for changes to take effect. The KubeRay Operator does not currently support rolling upgrades, although this feature is slated for inclusion in the v0.6.0 roadmap.

Logging

We utilize Datadog for our monitoring and logging needs. By default, the Datadog agent only tails the container’s stdout and stderr in pods. For Ray, you will only see the startup of the Ray runtime in the output of the head pod, as well as the Autoscaler logs. However, you won’t see the system component logs such as those from the raylet, gcs_server, dashboard, and others. This is because Ray writes these logs to files in the /tmp/ray/session_*/logs directory, not to stdout.

These logs are crucial for troubleshooting issues. If you’re running the autoscaler where nodes frequently start and stop, the logs from the nodes disappear from the Ray Dashboard. Therefore, it’s particularly important to have these logs available in Datadog or your chosen solution for log persistence.

Since the component logs are not written to stdout, you’ll need to implement a solution to redirect them there. Initially, we considered the Datadog option to collect logs from a file that you can configure in an annotation. However, this would require creating a shared hostPath volume for both the Head node pod and the Datadog agent pod. This arrangement would allow the Datadog agent to mount the path and read from it. However, this solution would couple Ray and Datadog too closely, so we decided against it. Instead, we opted for the documented solution of running Fluent-bit as a sidecar container.

Dependencies

For development and experimentation, Ray has a concept of runtime environments to install dependencies dynamically during runtime. This works fine, however you may get to a point where installing packages can start taking a long time; especially the big ML specific packages. So the next thing to do would be to package up the dependencies in the cluster in advance by way of a custom image with the dependencies installed. However, you need to consider that bumping up dependency versions can potentially break other downstream consumers. What we ended up doing is creating a global base image and this base image only containing the common huge dependencies that would take time to install. This included packages such as the cudatoolkit and nvidia, to be baked into a global base image.

Head Node OOM (Out of Memory) Issues

During our testing phase, we encountered several issues related to the head node pod running out of memory (OOMing). Here are some important points to consider in relation to these findings:

  • Right-size your head node pod. Determine appropriate resources based on baseline data from your monitoring system. Keep in mind that if the head node pod fails due to the pod OOMing, the entire Ray cluster will also fail, resulting in a loss of all cluster-level data. We will be considering strategies to make the head node fault-tolerant as we move towards productionization.
  • Occasionally, a job may terminate without an explicit OOM error from the pod. This behavior can be attributed to the memory monitor run by the ‘raylet’ process on each node. This monitor checks memory usage at regular intervals and compares it to a configurable threshold. If memory usage exceeds this threshold, the raylet terminates the corresponding task to free up memory, thus averting a total Ray cluster shutdown. We observed this scenario during checkpoint syncing, a process that preserves the model’s state during training. This procedure would deplete all the memory on the head node. Interestingly, we should be able to upload checkpoints to Amazon S3, an enhancement worth exploring as we progress towards productionization.

Conclusion

Building a Ray cluster has been a rewarding journey for our team. Despite confronting a handful of challenges, these experiences have evolved into invaluable lessons and laid a solid foundation, and we are excited with the next steps to come for Productionization.

Our goal in sharing these insights and learnings is to support those in the community that might be considering developing and deploying a Ray cluster. We hope that our experiences will contribute to a smoother progression for you in that process.

References

Ray Documentationhttps://docs.ray.io/en/latest/index.html

Ray Githubhttps://github.com/ray-project

Deploying the KubeRay Operator via Helmhttps://docs.ray.io/en/latest/cluster/kubernetes/getting-started.html#deploying-the-kuberay-operator

Ray Slack channelhttps://www.ray.io/community The Ray team members have been extremely helpful with answering my questions and figuring out bugs that I found. Shout and Thanks to the Ray team!

--

--