Tips on installing and maintaining Ray Cluster

Published in

juniper-team

4 min readJan 14, 2021

Authors: Sabyasachi Mukhopadhyay, Subhabrata Banerjee, Pooja Ayanile, Divyank Garg, Ajit Patankar

Although the installation of Ray cluster is fairly straightforward, there is still some learning curve involved in getting to a robust installation. The goal of this blog post is to facilitate this learning process.

While Ray is primarily intended to be used in a cluster of multiple machines, it can also be used on a single machine and in that case it will try to use all the available cores on that machine and one can still get some performance improvement. We suggest that one can try the latter approach before moving on to a multi-machine cluster.

Ray installation on a local machine

Ray on a local machine can be installed and tested using the following procedure:

Install ray module by using pip install Ray
Start ray as — ray start — head
Open Jupyter notebook or python and import Ray library and start ray using ray.init()
Use different Ray libraries provided by ray for performing machine learning in parallel computation
Stop ray in Jupyter notebook or python using ray.shutdown()
Stop ray — ray stop — force

Note: there may be some VPN issue while performing Ray installation on a local machine. One example of such error is as follows:

Error: redis.exceptions.ConnectionError: Error 60 connecting to 10.X.Y.Z:<port>. Operation timed out. This issue can be resolved by invoking Ray init as follows:

import ray.servicesray.services.get_node_ip_address = lambda: ‘127.0.0.1’ray.init()

Ray installation on multiple machine

There are two methods for Ray installation of a multi-machine cluster: First method is to use Ray cluster Launcher on GCP, Amazon, Azure or even Kubernetes. Second method is to manually connect all the virtual machines and make a Ray cluster

Important points in this context are as follows:

Ray Nodes: A Ray cluster consists of a head node and a set of worker nodes. The head node needs to be started first, and the worker nodes are given the address of the head node to form the cluster.
Ports: Ray processes communicate via TCP ports. When starting a Ray cluster, either on prem or on the cloud, it is important to open the right ports so that Ray functions correctly. By default Ray starts at 6379 port and ray dashboard starts at 8265 port.

Ray cluster use case

In our use case we installed a Ray cluster on 5 VMs as shown in the figure 1. To start this cluster Ray needs to be installed on all the VMs so that they can communicate with each other. While this may seem trivial, in practice this requires automation scripts if the number of machines in a cluster are large or dynamic. Also, it is important to make sure that the Ray version is the same in all the machines otherwise it shows version error and thus cannot connect to the head node. Also, other Python libraries should be present in all the VMs to easily communicate with worker nodes and perform the task.

To address this problem, the best practice is to create a virtual environment in one head node VM and then install all the required libraries in it and then save the requirements.txt file. Then that environment can be copied to other VMs and the same version of all libraries will be available on all the machines.

Figure 1: Sample Ray Cluster

Steps for Ray Multi-Machine Cluster Installation

Install same python version across all the VMs(cluster VMs)
Install the Ray module and other packages of same version in each VM virtual environment by using a requirements file
Turn off the firewall on the head node first and then start the ray services. The firewall needs to be stopped in each VMs before starting the ray for cluster communication otherwise it show firewall issue while connecting worker nodes — systemctl stop firewalld
Start ray as head in one of the VM and ray starts with some port address along with redis password- ray start — head
Join others worker nodes in cluster by ray start — address=’<head node ip>:<head port>’ — redis-password=’<pwd>’
To connect your local machine to Ray cluster then do ssh tunneling to check Ray Dashboard — ssh -N -f -L localhost:8265:localhost:8265 <user_name>@<head_node_ip>
If you want to stop ray on worker nodes or head nodes then use the following command to stop the ray services: ray stop — force

Conclusions

In this post we have presented the gist of installing and configuring a Ray cluster in a private cloud environment. This approach provides a good path towards a larger cluster in a public cloud. Important lessons in this process are as follows:

Automation scripts to create consistent environments across all nodes are important.
Network security policies can cause issues during installation and in production.

In the subsequent blogs, we will be discussing various use cases such as migrating the python legacy code to Ray actors and remote functions and taking advantage of parallel computation. We will also address important aspects of model training such as model selection and distributed training. A distinguishing feature of our analysis is performance benchmarking in each use case.

Reference to previous blogs-

Ray: Distributed Computing Framework for AI/ML applications (link: https://medium.com/juniper-team/ray-distributed-computing-framework-for-ai-ml-applications-4b40617be4a3)