Scripting PySpark Dataframes
Reproducing and transporting dataframes by generating plain Python scripts
Developing Spark applications means dealing with Spark DataFrames. These objects are in-memory data structures accessible via various APIs but locally scoped. Sometimes we need to use them outside the runtime environment. Scripting a dataframe as the collection of Python commands that fully reproduce this object is a possible and occasionally preferred solution.
The use cases
The most common example is a debugging of a production issue. So, think of a Data Processing framework built around Spark. It runs in a production environment, and you spot that one of the workflows fails. Further checks show that there is a flaw in the processing logic. It is handy to identify those few rows on which the processing crashes, create a dataframe and then transfer it to your development environment.
The transported dataframe might then be used for the following:
- Debugging the issue in the development/feature environment
- Creating a unit test for preventing such kind of errors
- Creating a default sample data when the new development/feature environment must be deployed