How to Scale Python on Every Major Cloud Provider
Best practices for cloud computing with Python
In the words of Pywren creator Eric Jonas, “The cloud is too damn hard!”
Once you’ve developed a Python application on your laptop and want to scale it up in the cloud (perhaps with more data or more GPUs), the next steps are unclear, and unless you have an infrastructure team that’s already set it up for you, “Just use Kubernetes” is not so simple.
So you’ll choose one of AWS, GCP, and Azure and navigate over to the management console to sift through instance types, security groups, spot prices, availability zones, instance limits, and so on. Once you’ve got all of that sorted out and managed to rent some virtual machines, you’ve still got to figure out how to actually run your application on them. How exactly does my Python script divide up the work across a 10 machine cluster? At this point, you may try to rewrite your application using PySpark or mpi4py or Celery.
If that fails you’ll build a brand new distributed system like Uber and Lyft recently did with Fiber and Flyte. In either case, you’re either rewriting your application, building a new distributed system, or both. All to speed up your Python script in the cloud.
We’re going to walk through how to use Ray to launch clusters and scale Python on any of these cloud providers with just a few commands.
Here it is in a diagram. Step 1 is to develop your Python application. This usually happens on your laptop, but you can do it in the cloud if you prefer. Step 2 is to launch a cluster on your cloud provider of choice. Step 3 is to run your application wherever you want (your laptop or in the cloud).
Setup
First, install some Python dependencies.
pip install -U ray \
boto3 \
azure-cli \
azure-core \
google-api-python-client
Next, configure credentials for the cloud provider of your choice. You can skip this step if you’re already set up to use your cloud provider from the command line.
- AWS: Configure your credentials in
~/.aws/credentials
as described in the boto docs. - Azure: Log in using
az login
, then configure your credentials withaz account set -s <subscription_id>
. - GCP: Set the
GOOGLE_APPLICATION_CREDENTIALS
environment variable as described in docs.
The Ray cluster launcher uses a config file that looks something like this
Here you can specify everything including setup commands to run, instance types to use, files to rsync, autoscaling strategies, and so on.
Fetch a slightly more complete example with additional configurable fields as follows.
# AWS
wget https://gist.githubusercontent.com/robertnishihara/a9ce1d0d434cd52dd9b07eb57f4c3476/raw/dafd4c6bd26fe4599dc3c6b05e80789188b3e2e5/aws-config.yaml
# Azure
wget https://gist.githubusercontent.com/robertnishihara/a9ce1d0d434cd52dd9b07eb57f4c3476/raw/dafd4c6bd26fe4599dc3c6b05e80789188b3e2e5/azure-config.yaml
# GCP
wget https://gist.githubusercontent.com/robertnishihara/a9ce1d0d434cd52dd9b07eb57f4c3476/raw/1846cbf971d1cd708b3d29d9ae50ad882fbaac50/gcp-config.yaml
You’ll need to make a few minor modifications to the above config files:
- Azure: Replace
~/.ssh/id_rsa
and~/.ssh/id_rsa.pub
with the appropriate key files. - GCP: Set the appropriate
project_id
.
Step 1: Create a Python Application
Define a simple Python script that we want to scale up.
You can run the example (actually a slightly more complete one) as follows.
# Fetch the example.
wget https://gist.githubusercontent.com/robertnishihara/a9ce1d0d434cd52dd9b07eb57f4c3476/raw/4313660c0bd40f8bd909f70c1e0abc4be8584198/script.py# Run the script.
python script.py
You’ll see the following output.
This cluster consists of
1 nodes in total
16.0 CPU resources in totalTasks executed
10000 tasks on XXX.XXX.X.XXX
Now run the same script on a Ray cluster. These instructions are for AWS. To use Azure or GCP, simple replace aws-config.yaml
with either azure-config.yaml
or gcp-config.yaml
in both lines.
Step 2: Start the Cluster
# AWS
ray up -y aws-config.yaml
# Azure
ray up -y azure-config.yaml
# GCP
ray up -y gcp-config.yaml
Step 3: Run the Script on the Cluster
# AWS
ray submit aws-config.yaml script.py
# Azure
ray submit azure-config.yaml script.py
# GCP
ray submit gcp-config.yaml script.py
You’ll see the following output.
This cluster consists of
3 nodes in total
6.0 CPU resources in totalTasks executed
3561 tasks on XXX.XXX.X.XXX
2685 tasks on XXX.XXX.X.XXX
3754 tasks on XXX.XXX.X.XXX
If it says there is only 1 node, then you may need to wait a little longer for the other nodes to start up.
In this case, the 10,000 tasks were run across three machines.
You can also connect to the cluster and poke around with one of the following commands.
# Connect to the cluster (via ssh).# AWS
ray attach aws-config.yaml
# Azure
ray attach azure-config.yaml
# GCP
ray attach gcp-config.yaml
Shutdown the Cluster
Don’t forget to shutdown your cluster when you’re done!
# AWS
ray down -y aws-config.yaml
# Azure
ray down -y azure-config.yaml
# GCP
ray down -y gcp-config.yaml
Further Details
Want to add support for a new cloud provider? Just extend the NodeProvider class, which usually takes 200–300 lines of code. Take a look at the implementation for AWS.
- Learn more about the Ray cluster launcher.
- Learn more about Ray from the documentation, GitHub, and Twitter.
- Learn more about how to scale machine learning training, hyperparameter tuning, model serving, and reinforcement learning.
If you have any questions, feedback, or suggestions, please join our community through Discourse or Slack! If you would like to see how Ray is being used throughout industry, consider joining us at Ray Summit!