A Flexible PySpark Job (Spark Job in Python) Script Template
1 min readSep 9, 2016
I rarely create Spark jobs in Scala unless forced because of some configuration limitation in the Spark Cluster.
This is a concise Spark job template in Python I’ve summarized after creating many that are not very flexible. The idea on this template is that it should be able to:
- Support multiple functions in a single script and run these functions either in sequence (similar to a pipeline) or individually
- Handle arguments passed through spark-submit
- Support run time tracking (crude implementation)
Of course, Spark-submit is used to run this.
Spark-submit should be like this:
spark-submit PYTHON_JOB_SCRIPT_PATH FUNCTION_OR_TASK_TO_RUN OTHER_ARGS