Scale AI on Ray on Vertex AI: Let’s get it started

Published in

Google Cloud - Community

8 min readDec 18, 2023

At Next 2023, Google Cloud announced Ray on Vertex AI, a managed service that provides scalability for AI and Python applications using Ray.
As a Vertex AI developer, you may be wondering:

As a Vertex AI developer, you may be wondering:

What is Ray?
Why should I use Ray on Vertex AI?
How can I use Ray to train and deploy AI/ML models on Vertex AI?

This article gives you an introduction of Ray and it aims to answer all those questions. By the end of this reading, you will know how to set up a Ray cluster on Vertex AI and how to run a simple distributed training using Ray on Vertex AI.

What is Ray

Ray is a flexible distributed computing framework for the Python data science community. Distributed Python is not new, and Ray is not the first framework in this space. Some of the most common distributed frameworks are Spark, Dash, Celery, etc. So why should you consider Ray?

According to Ray documentation, there is no single, unified distributed framework in the ML Ecosystem. The figure below gives you an idea about the variety of distributed frameworks in the ML ecosystem.

Figure 3— Image updated from Ray’s talk given at UC Berkeley DS100

For each step of the ML process, each distributed framework is specialised and has its own way for task parallelism, scheduling, resource management, and more. This can make it difficult to scale a ML application, as it requires developers to understand and manage different distributed patterns and programming interfaces.

Also, in this era of LLMs and GenAI, scaling a machine learning application today is a complex and challenging task. It requires significant effort and specialised expertise. To get a sense of how challenging it can be to distribute an ML workload, take a look at this article about how Google Cloud scaled the largest distributed LLM training job on a TPU GKE cluster.

Due to these challenges, Ray aims to provide a simple, general and unified Python distributed framework which is flexible enough to scale any machine learning workloads.

It is possible because Ray provides a three-layer distributed computing framework. In particular, you have:

Ray Clusters which represents a flexible set of workers connected with a head node capable of auto-scaling based on application needs and agnostic to the underlying infrastructure (Kubernetes, AWS, GCP, and Azure)
Ray Core which is a low-level general-purpose distributed computing API, enabling scaling of any Python application and accelerating ML workloads.
Ray AI Libraries which are domain-specific libraries for ML engineers and scientists, simplifying common tasks like training, serving, and tuning models.

Looking into Ray Core, assuming you have a Ray cluster up and running, distributing Python functions and classes is quite intuitive. Below you can see a simple example of distributing a function in Ray.

import ray

ray.init() 

@ray.remote
def hello_world():
    return "hello world"

print(ray.get(hello_world.remote()))

With .init(), you initiate a session with the Ray Cluster. Then, you define the function you want to distribute and you turn it to Ray tasks by wrapping them with @remote decorator. And with .remote(), you execute them asynchronously as different processes (workers) on Ray Cluster. Finally, the results is stored distributed object store and you return it using .get().

As you see, in just three steps, with Ray you are able to distribute your function. Of course, there are more other fundamentals APIs to know. But this gives you an idea of how Ray makes distributed computing simpler than before.

Built on top of Ray Core, Ray AI Runtime (AIR) libraries offer a framework with an easy-to-use API interface for distributing AI workloads which is well-integrated with the entire AI ecosystem. In this way, Ray minimizes the complexity of leveraging distributed computing of the most common AI/ML workloads along the entire end-to-end ML process. The figure below shows how with Ray AI Runtime (AIR) libraries you can distribute data preprocessing, training, tuning and serving of an XGBoost model.

Now that you know what Ray is and how it would help you to distribute your AI/ML workloads, you may wonder: why use Ray on Vertex AI?

Why Ray on Vertex AI

Although Ray makes ML distributed computing more accessible, deploying Ray to Kubernetes or VMs can get tricky quickly. Self-hosting your own Ray cluster involves setting up and managing compute, storage, network and security resources. Once deployed, you have also set up some utilities including notebook environments such as JupyterLab or Ray OSS dashboard for logging and monitoring.

Figure 4 — Image inspired by Ray on Vertex documentation

Ray on Vertex AI is a simpler way to get started with Ray for running your ML distributed computing workloads needing to become a DevOps engineer. The integration also provides access to well established MLOps components of the Vertex AI platform, including feature stores, datasets, pipelines, model registry, model monitoring and other Google Cloud services like BigQuery and Cloud Storage.

Ray on Vertex AI also offers a friendly interface to create and configure Ray clusters using the Vertex AI Python SDK. Once the cluster is created you can then use the same open source Ray code to write programs and develop applications efficiently.

Let’s see how to create a Ray cluster and start using Vertex AI on Ray to run a simple ML workload.

Get started with Ray on Vertex AI

On Vertex AI, you can create a Ray on Vertex AI cluster using either the console or the Ray on Vertex AI Python SDK.

Looking at Vertex AI SDK, you can use a default provisioning request or you can specify the number of cluster nodes, machine types, and accelerator as needed to provision a Ray cluster on Vertex AI. In the following code, you can see an example of Ray cluster provisioning using Vertex AI SDK.

import vertex_ray
from vertex_ray import Resources
from google.cloud import aiplatform as vertex_ai
vertex_ai.init(project=project_id, location=region, staging_bucket=bucket_uri)

# Define a Ray cluster
head_node_type = Resources(
    machine_type="n1-standard-8",
    node_count=1
)

worker_node_types = [Resources(
    machine_type="a2-highgpu-1g",
    node_count=2,
    accelerator_type="NVIDIA_TESLA_A100",
    accelerator_count=2,
)]

# Run a Ray cluster provisioning job 
ray_cluster_resource_name = vertex_ray.create_ray_cluster(
    head_node_type=head_node_type,
    worker_node_types=worker_node_types,
    python_version='3_10',
    ray_version='2_4',
    cluster_name=your_cluster_name,
    network=your_network_name,
)

With the create_ray_cluster method you provision a Ray cluster with multiple nodes (2) and multiple GPUs (2) seamlessly. Below you can see a Ray cluster in the Ray on Vertex UI.

Also you can leverage the Vertex AI Python SDK to get, list and delete the Ray cluster. Here you can see how to get information about the Ray cluster.

ray_clusters = vertex_ray.list_ray_clusters()
ray_cluster_resource_name = ray_clusters[0].cluster_resource_name
print(ray_cluster_resource_name)

#projects/your-project/locations/your-region/persistentResources/your-ray-cluster-name

ray_cluster = vertex_ray.get_ray_cluster(ray_cluster_resource_name)
print(ray.cluster_resources()

# Ray cluster resources
# {'CPU': xxx, 
#  'GPU': xxx,
#  'accelerator_type:A100: xxx,
# ...
# 'object_store_memory': xxx}

Now that you have the Ray cluster you are able to distribute any ML workload you need.

With Ray on Vertex AI SDK, you can develop your ML code in the IDE you prefer, connect to the Ray on Vertex AI cluster and execute it remotely using Ray Jobs API. You can submit a Ray job using Python, the Ray Jobs CLI, or the public Ray dashboard address depending on your network settings.

In the following example, you can see how to distribute a simple XGboost training on Ray Cluster in Vertex AI with Python.

First, you develop your ML application as a Python script in any IDE.

# libraries  
import ray
from ray.runtime_env import RuntimeEnv
from ray.air.config import RunConfig
from ray.air import CheckpointConfig, ScalingConfig
from ray.train.xgboost import XGBoostTrainer
from ray.train.xgboost import XGBoostCheckpoint

# ray on vertex ai
from google.cloud import aiplatform as vertex_ai
import vertex_ray
from vertex_ray.predict import xgboost as vertex_xgboost

# init ray cluster
runtime_env = RuntimeEnv(
    pip={"packages": "requirements.txt"}
)
ray.init(runtime_env=runtime_env)

# xgboost config
xgboost_config = {
  "objective": experiment_run_args.objective,
  "eval_metric": [experiment_run_args.eval_metric],
  "eta": experiment_run_args.eta,
  ...
}

# scaling config
scaling_config = ScalingConfig(
  num_workers=2,
  use_gpu=True if 'GPU' in ray.cluster_resources().keys() else False,
  resources_per_worker={
    "CPU": 4,
    "GPU": 1
  }
)

# run config
run_config = RunConfig(
  checkpoint_config=CheckpointConfig(
    num_to_keep=None
  ),
  sync_config=SyncConfig(
      upload_dir=experiment_run_args.training_dir
  ),
  name=experiment_run_id,
)
# ray trainer
trainer = XGBoostTrainer(
  scaling_config=scaling_config,
  run_config=run_config,
  label_column="tip_bin",
  params=xgboost_config,
  datasets={"train": train_dataset, "valid": valid_dataset},
)

trainer.fit()

It is important to highlight that the only changes required to run the same code on Ray on Vertex AI is importing the aiplatfrom library and adding the vertex_ray:// prefix to the Ray cluster resource name. Once you get the Python script, you initiate the job client and submit the job as below.

from google.cloud import aiplatform as vertex_ai
import ray
from ray.job_submission import JobSubmissionClient
from ray.job_submission import JobStatus

client = JobSubmissionClient(address=ray_cluster_address)

job_id = client.submit_job(
  submission_id=submission_id,
  entrypoint="python3 train.py",
  runtime_env={"working_dir": working_dir}
)

while True:
    job_status = client.get_job_status(job_id)
    if job_status == JobStatus.SUCCEEDED:
        print("Job succeeded!")
        break
    else:
      if job_status == JobStatus.FAILED:
        print("Job failed!")
        break
      else:
        print("Job is running...")
        time.sleep(10)

# Job is running...
# Job is running...
# Job is running...
# ...
# Job succeeded!

And you can monitor the training job using the OSS Ray Dashboard on Ray on Vertex AI. Below you have the task breakdown which provides the profiling associated with the Ray training job above. To know more about Ray Dashboard, check out the official documentation.

Conclusion

This article briefly introduces Ray, shows how to set up a Ray cluster on Vertex AI, and run a distributed ML training using Ray on Vertex AI. By now, you know that Ray allows you to scale any ML workload providing a simple, general and unified Python distributed framework. And thanks to Ray on Vertex AI, you can get access to Ray cluster in minutes and run your ML applications using Ray with minimal changes.

Scale AI on Ray on Vertex AI Series

This article is part of the Scale AI on Ray on Vertex AI series where you learn more about how to scale your AI and Python applications using Ray on Vertex. And, follow me, as more exciting content is coming your way.

Thanks for reading

I hope you enjoyed the article. If so, please clap or leave your comments. Also let’s connect on LinkedIn or X to share feedback and questions 🤗

Big kudos to Ed Muthiah, Karen Lin and Amy Wu for their feedback.

Google Cloud - Community

Scale AI on Ray on Vertex AI: Let’s get it started

What is Ray

Why Ray on Vertex AI

Get started with Ray on Vertex AI

Conclusion

Scale AI on Ray on Vertex AI Series

Thanks for reading

References

Published in Google Cloud - Community

Written by Ivan Nardini

Responses (1)