Creating Transient Clusters with Cloudera Director
Cloudera Director is a great tool for spinning up CDH clusters in AWS/GCE — one that I use almost daily, as I’m sure others do as well. While it has been great for automating the process of provisioning resources, deploying services, and managing cluster resources in its purview, I have yet to see a really good end-to-end demonstration of how it could be used to spin up transient/ephemeral clusters for batch processes — processes where you only need cluster resources for a particular task, at which point you can copy the result somewhere persistent (S3, for example) and then destroy the instances.
Using the Cloudera Director Python client and borrowing/modifying and adding to some of the example code available on GitHub, I have created an example workflow where the following occurs:
- Spins up a cluster with 1 master, 1 gateway, and the number of data/Spark nodes quantified in a configuration file
- When the cluster is up, your JAR file (the job) is copied to the home directory of the user that created the cluster
- An HDFS directory is created for /user/username owned by that user
- The job is executed using spark-submit
- A postscript is then ran (in my example copying the entire user directory/output to S3)
- Deletes the cluster that was created
The entire process takes about 20 minutes — once you’ve configured the config.ini file with the required information and populated the postscript (copy_results.sh) with your AWS/S3 information, you can run a Spark job (.jar) and copy the results to S3 with a command like this:
python ephemeral-spark-submit.py --admin-username admin --admin-password password --server http://ec2-yourdirectorinstance.compute-1.amazonaws.com:7189 — cm CM01 — environment CDH5_5 — jar spark-pi.jar — jarclass org.apache.spark.examples.SparkPi — args “100 10” — script copy_results.sh cluster.ini
The code/script could also be easily modified to include a pre-script that may do something like copy files to the cluster before running the JAR.
You can download the code here:
Also check out the Cloudera Director Python client here: