A Step-by-Step Guide to Scaling Your First Python Application in the Cloud

Bill Chambers
Distributed Computing with Ray
8 min readMay 6, 2020
Sunset near O’Hare Airport (© Dean Wampler, 2010)

This blog post shows you how easy it is to get started writing distributed Python Applications. In general, to pursue a blog post like this, you must learn a lot about how to deploy, manage, and scale distributed applications. However, this post will be much simpler. That’s because Ray makes it easy to build, write, manage, and scale distributed applications.

While you might want to learn more about Ray from the documentation, the TL;DR is that Ray is a fast and simple framework for distributed applications in Python.

Ray can replace standard Python multiprocessing with a single line of code, allow you to easily scale up scikit-learn, or build your own a parameter server for distributed machine learning. On the other hand, you might be interested in some of Ray’s core libraries: RayTune for Scalable Hyperparameter Tuning, RaySGD for distributed training, and RLlib for Scalable Reinforcement Learning.

The Ray Ecosystem

Whatever the case, this will show you how to build and deploy a simple distributed Python[Ray] application on your laptop and in the cloud. Over the course of this post, you’ll learn how to:

  • build a simple Ray application (it’s just a Python script),
  • run that application on your laptop or desktop,
  • start a Ray cluster in the cloud, and
  • run your Ray application in the cloud.

Let’s go ahead and get started with the simplest Ray app, just to kick the tires.

A Simple Ray app

Go ahead and open a new folder on your laptop and let’s get started. You’ll need to install Ray as well.

pip install ray

With that now complete, let’s go ahead and write a simple app that will add 1 to whatever value that we input to our function. Put the following code in a file named step_1.py.

Here’s our simple Ray application which is just a python script. Go ahead and save this in a file called step_1.py. This app has a single function that sleeps for one second while adding a one to the input value. You’ll note that this function has a python decorator function, ray.remote. This denotes this function as a Ray Task, which enables this arbitrary python function to be executed asynchronously.

Running on our local machine

Running this application is simple, it’s just like any other Python script on our local machine.

wget https://gist.githubusercontent.com/anabranch/60b62279477ee741de32e263b9c7ac67/raw/2cc5f13cbe6ff3cb8cfb5a1320177f7850468ad7/step_1.py
python step_1.py

Here’s the output when I run this script(excluding the base Ray logging).

Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 5.235158

Normally, this Python function would run serially. However, with Ray, it runs all of these tasks in parallel (by making the function a ray.remote task). Now that you’ve got that up and working, let’s move onto the cloud version!

To learn more about more sophisticated Ray applications, please see the documentation. In this example we’re going to focus more on deployment as opposed to writing more sophisticated applications.

Running on the Cloud

What makes Ray so powerful is that it makes it dead simple to write programs that scale from a single machine to many machines, without any distributed systems expertise. We’ll run on AWS for this example, but Ray can launch clusters on AWS, Azure, GCP, and Kubernetes. We’ll assume that you already have boto3 and AWS CLI credentials setup (here’s a quick tutorial if you don’t have them set up).

To run on the cloud, we must specify some important details. Luckily, Ray keeps this very lightweight and doesn’t require too much from us. The resources we are going to specify are:

  • the name of our cluster,
  • the number of workers we’ll be running,
  • the cloud provider, and
  • any setup commands that should run on the node upon launch.

Cluster YAMLs

To run your Ray cluster, you must specify the resource requirements in a cluster.yaml file. While this doesn’t need to be named cluster.yaml, it’s a convention for Ray applications.

Here’s our “scaffold” cluster specification. This is just about the minimum that you’ll need to specify to launch your cluster. For more information about cluster yaml files, see a large example here.

Here is our cluster YAML file.

Notice that setup_commands allows us to specify arbitrary setup commands. This can help you make sure that you’re configuring all the nodes in your cluster correctly.

Starting the cluster

Let’s start our cluster! We can actually start and automatically submit a command or run a training in a single command, but we’ll do things the slower way to explain the distinct steps.

The first thing we can do is start up our cluster. Note, this will use AWS resources and you will be charged for them when you run this command.

wget https://gist.githubusercontent.com/anabranch/818102592419a8c752f0ef2e96f0be74/raw/377d747fd0c9fdc0f1903305dcd66cdafbf86257/cluster.yaml
ray up cluster.yaml

This will start up the machines in the cloud, install your dependencies and run any setup commands that you have, configure the Ray cluster automatically and prepare you to scale your distributed system! This example shows simple and automatic cluster configuration, you can also manually configure clusters.

Updating a cluster

In the above example, you can see that we specified max workers to be 0. This will still launch a worker node on the driver node of the cluster, however this isn’t necessarily ideal. We can resolve this by simply changing the value to 1 followed by executing ray up cluster.yaml another time to update the cluster. You can do this with most configurations in the cluster YAML file. However, you shouldn’t do it for the cloud provider specifications (e.g., change from AWS to GCP on a running cluster) or change the cluster name (as this will just start a new cluster and orphan the original one).

Running Commands

After ray up finishes running, it’s time to actually run the application on the cluster! Sometimes you might want to run arbitrary shell commands on the head node, want to check the result of some process, or just know what’s going on in the cluster. We do that with the exec command.

ray exec cluster.yaml 'echo "hello world"'

Running Scripts

Additionally, we can run scripts. To do this, we’ll run ray submit. This command allows us to submit scripts to run on our cluster. We can do that with our current script. When we do this, we’ll be able to see the output of that script as well.

Ray submit

Now we can submit our script to the cluster.

ray submit cluster.yaml step_1.py

Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 7.94613

Now looking at the elapsed time, we can see that this is running slower than on our local machine… by several seconds! The reason for this is that we haven’t actually specified that we want to run on the cluster. To do that, we have to make a simple code change, when we initialize Ray, we must connect to the Ray cluster manager. To do so, change ray.init() to ray.init(address='auto'). Now if we run it again, we’ll see that the time ends up being comparable to my local machine. That’s because the number of CPUs on my local machine match the cloud instance.

ray submit cluster.yaml step_1.py

Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 5.020565

You’re going to see variations in the performance here based on your instances as well as your local machine for the first time we executed the code. All you need to do is scale your resources (or the work) to find the noticeable difference.

Syncing Files to the Cluster

Now, more often than not, you’re going to want to have a script that references other files. Let’s refactor our awesome script to become more modular. (prepare yourself for a contrived example).

Let’s define a file named utils.py with the following contents.

Now let’s “refactor” our script to become a new file step_2.py that has the following contents.

Now, we can run that on our local machine. Here are the results:

wget https://gist.githubusercontent.com/anabranch/827fb6415d83472bfe60007992a57843/raw/08aa66c6ca56e95a58b7eb376a03c83c69216d9f/utils.py
wget https://gist.githubusercontent.com/anabranch/1c90a3dc4ba5a4b438f1848f0dba0b18/raw/04b2462ed4fef7593968490734be0c94b578d8b9/step_2.py
python step_2.py
Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 0.200679

Now let’s change ray.init() to become ray.init(address='auto') again and try to run it with Ray submit.

ray submit cluster.yaml step_2.py

Traceback (most recent call last):
File “/home/ubuntu/step_2.py”, line 5, in <module>
from utils import adder, timer
ModuleNotFoundError: No module named ‘utils’

*GASP* it doesn’t work!!! That’s because Ray doesn’t know where to look for this utils file or module. However, the fix is quite simple, we just need to use rsync-up to get our files onto the cluster. The Ray CLI provides this ability out of the box. To do so, I specify: the cluster, the local file (or directory), the remote file (or directory).

ray rsync-up cluster.yaml utils.py utils.py

This places all files (you can’t exclude files in the current version) in the directory in the remote directory on the head node of the cluster. You can add the -A flag to specify that you want to rsync-up to all nodes on the cluster. Doing so is not necessary in our current example because the cluster will serialize our function around the cluster. However, you might need to do this with more complex libraries. In general, I’d recommend using a distributed file system or file mounts if possible.

After we run the above command, ray submit will work just fine.

ray submit cluster.yaml step_2.py

Return Value: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Elapsed Time: 0.002071

I’m not limited to syncing files up the cluster. I can sync them down just as easily. For instance, in our setup_commands, you might have noticed we put a file in tmp. Let’s rsync that file down.

ray rsync-down cluster.yaml /tmp/some_file.txt some_file.txt

Using SSH to log into the cluster

Lastly, you might want to log into the cluster and poke around. We do so by running the following:

ray attach cluster.yaml

Now you’re logged into the machine and you should see the following (or something similar) when you list the files.

$ ls
src step_2.py step_1.py utils.py ...

Shutting down the Cluster

Now that we’ve taken the tour of some of the ways you can run commands on the cluster. Let’s go ahead and tear it down. To do so, simply run

ray down cluster.yaml

At this point, this will stop the nodes in the cluster but won’t fully tear them down. While this is great if you plan on picking up your work a bit later. You’ll need to explicitly terminate the nodes either in the AWS console or via the AWS CLI if you don’t want to keep them around for the long term.

Conclusion

This post provided a basic tour of running Ray applications. We ran them on our local machine, on the cluster, and did so with a single file as well as multiple files. In a future post, we’ll cover some of the more advanced ways of running commands and setting up our projects.

To see the other things that you can do to a cluster, simply run ray --help!

If you have any questions, feedback, or suggestions, please join our community through Discourse or Slack. If you would like to see how Ray is being used throughout industry, consider joining us at Ray Summit. If you’re interested in working on Ray and helping to shape the future of distributed computing, we’re hiring at Anyscale!

--

--