A Step-by-Step Guide to Scaling Your First Python Application in the Cloud
This blog post shows you how easy it is to get started writing distributed Python Applications. In general, to pursue a blog post like this, you must learn a lot about how to deploy, manage, and scale distributed applications. However, this post will be much simpler. That’s because Ray makes it easy to build, write, manage, and scale distributed applications.
While you might want to learn more about Ray from the documentation, the TL;DR is that Ray is a fast and simple framework for distributed applications in Python.
Ray can replace standard Python multiprocessing with a single line of code, allow you to easily scale up scikit-learn, or build your own a parameter server for distributed machine learning. On the other hand, you might be interested in some of Ray’s core libraries: RayTune for Scalable Hyperparameter Tuning, RaySGD for distributed training, and RLlib for Scalable Reinforcement Learning.
Whatever the case, this will show you how to build and deploy a simple distributed Python[Ray] application on your laptop and in the cloud. Over the course of this post, you’ll learn how to:
- build a simple Ray application (it’s just a Python script),
- run that application on your laptop or desktop,
- start a Ray cluster in the cloud, and
- run your Ray application in the cloud.
Let’s go ahead and get started with the simplest Ray app, just to kick the tires.
A Simple Ray app
Go ahead and open a new folder on your laptop and let’s get started. You’ll need to install Ray as well.
pip install ray
With that now complete, let’s go ahead and write a simple app that will add 1 to whatever value that we input to our function. Put the following code in a file named step_1.py
.
Here’s our simple Ray application which is just a python script. Go ahead and save this in a file called step_1.py. This app has a single function that sleeps for one second while adding a one to the input value. You’ll note that this function has a python decorator function, ray.remote
. This denotes this function as a Ray Task, which enables this arbitrary python function to be executed asynchronously.
Running on our local machine
Running this application is simple, it’s just like any other Python script on our local machine.
wget https://gist.githubusercontent.com/anabranch/60b62279477ee741de32e263b9c7ac67/raw/2cc5f13cbe6ff3cb8cfb5a1320177f7850468ad7/step_1.py
python step_1.py
Here’s the output when I run this script(excluding the base Ray logging).
Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 5.235158
Normally, this Python function would run serially. However, with Ray, it runs all of these tasks in parallel (by making the function a ray.remote
task). Now that you’ve got that up and working, let’s move onto the cloud version!
To learn more about more sophisticated Ray applications, please see the documentation. In this example we’re going to focus more on deployment as opposed to writing more sophisticated applications.
Running on the Cloud
What makes Ray so powerful is that it makes it dead simple to write programs that scale from a single machine to many machines, without any distributed systems expertise. We’ll run on AWS for this example, but Ray can launch clusters on AWS, Azure, GCP, and Kubernetes. We’ll assume that you already have boto3 and AWS CLI credentials setup (here’s a quick tutorial if you don’t have them set up).
To run on the cloud, we must specify some important details. Luckily, Ray keeps this very lightweight and doesn’t require too much from us. The resources we are going to specify are:
- the name of our cluster,
- the number of workers we’ll be running,
- the cloud provider, and
- any setup commands that should run on the node upon launch.
Cluster YAMLs
To run your Ray cluster, you must specify the resource requirements in a cluster.yaml
file. While this doesn’t need to be named cluster.yaml
, it’s a convention for Ray applications.
Here’s our “scaffold” cluster specification. This is just about the minimum that you’ll need to specify to launch your cluster. For more information about cluster yaml files, see a large example here.
Here is our cluster YAML file.
Notice that setup_commands
allows us to specify arbitrary setup commands. This can help you make sure that you’re configuring all the nodes in your cluster correctly.
Starting the cluster
Let’s start our cluster! We can actually start and automatically submit a command or run a training in a single command, but we’ll do things the slower way to explain the distinct steps.
The first thing we can do is start up our cluster. Note, this will use AWS resources and you will be charged for them when you run this command.
wget https://gist.githubusercontent.com/anabranch/818102592419a8c752f0ef2e96f0be74/raw/377d747fd0c9fdc0f1903305dcd66cdafbf86257/cluster.yaml
ray up cluster.yaml
This will start up the machines in the cloud, install your dependencies and run any setup commands that you have, configure the Ray cluster automatically and prepare you to scale your distributed system! This example shows simple and automatic cluster configuration, you can also manually configure clusters.
Updating a cluster
In the above example, you can see that we specified max workers to be 0. This will still launch a worker node on the driver node of the cluster, however this isn’t necessarily ideal. We can resolve this by simply changing the value to 1 followed by executing ray up cluster.yaml
another time to update the cluster. You can do this with most configurations in the cluster YAML file. However, you shouldn’t do it for the cloud provider specifications (e.g., change from AWS to GCP on a running cluster) or change the cluster name (as this will just start a new cluster and orphan the original one).
Running Commands
After ray up
finishes running, it’s time to actually run the application on the cluster! Sometimes you might want to run arbitrary shell commands on the head node, want to check the result of some process, or just know what’s going on in the cluster. We do that with the exec command.
ray exec cluster.yaml 'echo "hello world"'
Running Scripts
Additionally, we can run scripts. To do this, we’ll run ray submit
. This command allows us to submit scripts to run on our cluster. We can do that with our current script. When we do this, we’ll be able to see the output of that script as well.
Ray submit
Now we can submit our script to the cluster.
ray submit cluster.yaml step_1.py
…
Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 7.94613
Now looking at the elapsed time, we can see that this is running slower than on our local machine… by several seconds! The reason for this is that we haven’t actually specified that we want to run on the cluster. To do that, we have to make a simple code change, when we initialize Ray, we must connect to the Ray cluster manager. To do so, change ray.init()
to ray.init(address='auto')
. Now if we run it again, we’ll see that the time ends up being comparable to my local machine. That’s because the number of CPUs on my local machine match the cloud instance.
ray submit cluster.yaml step_1.py
…
Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 5.020565
You’re going to see variations in the performance here based on your instances as well as your local machine for the first time we executed the code. All you need to do is scale your resources (or the work) to find the noticeable difference.
Syncing Files to the Cluster
Now, more often than not, you’re going to want to have a script that references other files. Let’s refactor our awesome script to become more modular. (prepare yourself for a contrived example).
Let’s define a file named utils.py
with the following contents.
Now let’s “refactor” our script to become a new file step_2.py
that has the following contents.
Now, we can run that on our local machine. Here are the results:
wget https://gist.githubusercontent.com/anabranch/827fb6415d83472bfe60007992a57843/raw/08aa66c6ca56e95a58b7eb376a03c83c69216d9f/utils.py
wget https://gist.githubusercontent.com/anabranch/1c90a3dc4ba5a4b438f1848f0dba0b18/raw/04b2462ed4fef7593968490734be0c94b578d8b9/step_2.py
python step_2.py
Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 0.200679
Now let’s change ray.init()
to become ray.init(address='auto')
again and try to run it with Ray submit.
ray submit cluster.yaml step_2.py
…
Traceback (most recent call last):
File “/home/ubuntu/step_2.py”, line 5, in <module>
from utils import adder, timer
ModuleNotFoundError: No module named ‘utils’
…
*GASP* it doesn’t work!!! That’s because Ray doesn’t know where to look for this utils file or module. However, the fix is quite simple, we just need to use rsync-up to get our files onto the cluster. The Ray CLI provides this ability out of the box. To do so, I specify: the cluster, the local file (or directory), the remote file (or directory).
ray rsync-up cluster.yaml utils.py utils.py
This places all files (you can’t exclude files in the current version) in the directory in the remote directory on the head node of the cluster. You can add the -A flag to specify that you want to rsync-up to all nodes on the cluster. Doing so is not necessary in our current example because the cluster will serialize our function around the cluster. However, you might need to do this with more complex libraries. In general, I’d recommend using a distributed file system or file mounts if possible.
After we run the above command, ray submit
will work just fine.
ray submit cluster.yaml step_2.py
…
Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 0.002071
I’m not limited to syncing files up the cluster. I can sync them down just as easily. For instance, in our setup_commands, you might have noticed we put a file in tmp. Let’s rsync that file down.
ray rsync-down cluster.yaml /tmp/some_file.txt some_file.txt
Using SSH to log into the cluster
Lastly, you might want to log into the cluster and poke around. We do so by running the following:
ray attach cluster.yaml
Now you’re logged into the machine and you should see the following (or something similar) when you list the files.
$ ls
src step_2.py step_1.py utils.py ...
Shutting down the Cluster
Now that we’ve taken the tour of some of the ways you can run commands on the cluster. Let’s go ahead and tear it down. To do so, simply run
ray down cluster.yaml
At this point, this will stop the nodes in the cluster but won’t fully tear them down. While this is great if you plan on picking up your work a bit later. You’ll need to explicitly terminate the nodes either in the AWS console or via the AWS CLI if you don’t want to keep them around for the long term.
Conclusion
This post provided a basic tour of running Ray applications. We ran them on our local machine, on the cluster, and did so with a single file as well as multiple files. In a future post, we’ll cover some of the more advanced ways of running commands and setting up our projects.
To see the other things that you can do to a cluster, simply run ray --help
!
If you have any questions, feedback, or suggestions, please join our community through Discourse or Slack. If you would like to see how Ray is being used throughout industry, consider joining us at Ray Summit. If you’re interested in working on Ray and helping to shape the future of distributed computing, we’re hiring at Anyscale!