Nice! Thanks for posting this.
Karan Talati

Hey Karen,

Thanks for the response :)

Regarding your approach, what do you mean by “traditional python packaging” ? if I understand correctly, you’re putting all your shared logic in a pip module which then you still need to pip install -t as I showed in my post.

The approach in the post doesnt contradict this.
We have most of our common business logic (thats not specific to PySpark jobs but shared throughout the systems\services) in a shared pip package too.

We then use that as a pip dependency to the jobs which aim to be as slim as possible.

The shared folder is used for PySpark-specific code that is shared across the jobs.
(For example logging, common generic RDD transformations, a base Context that all the jobs contexts derive from, etc)

So basically, what you’re describing fits this model perfectly:

  • Keep all your shared logic in a seperate pip package,
  • Seperate jobs by having each in its own module
  • Use shared folder for code that is PySpark specific but usable by the various jobs