A Flexible PySpark Job (Spark Job in Python) Script Template

rbahaguejr
1 min readSep 9, 2016

--

I rarely create Spark jobs in Scala unless forced because of some configuration limitation in the Spark Cluster.

This is a concise Spark job template in Python I’ve summarized after creating many that are not very flexible. The idea on this template is that it should be able to:

  1. Support multiple functions in a single script and run these functions either in sequence (similar to a pipeline) or individually
  2. Handle arguments passed through spark-submit
  3. Support run time tracking (crude implementation)

Of course, Spark-submit is used to run this.

Spark-submit should be like this:

spark-submit PYTHON_JOB_SCRIPT_PATH FUNCTION_OR_TASK_TO_RUN OTHER_ARGS

--

--

rbahaguejr

Data Scientist | Free Software Developer and Advocate, Debian & Ubuntu user. Contact: rbahaguejr2 (at) gmail (dot) com