In this article, we will cover the Spark Job pattern, why it matters to split the readers/writers definition from the Spark logic and how using the Transform API from Spark can enhance our code completely to another level.
This article is an extension of my previous Spark+AI Summit talk at SF (ref: http://bit.ly/afranzi-spark-summit-talk).
The Spark Job pattern aims to remove all external dependencies from what really matters, our interaction with the data.
Inside the Spark Job should only remain all the logic dedicated to transforming, modifying, altering, processing and changing the data. Everything else should be handled outside the SparkJob.
That means our SparkJob should interact with SparkReaders and SparkWriters agnostics to which data engines are they reading from. The Spark computation shouldn’t care about Redis, S3 Paths neither about Kafka topics.
Having this approach allows us to focus on the Spark computation, delaying the I/O challenge for our future doppelgangers.
Why does it matter to split the readers/writers definition from the Spark logic?
By splitting the definition of our Readers and Writers from the logic itself, we are dividing the Spark challenge and making Spark more approachable.
We are dividing the challenge into three pieces:
- From where are we reading? — Database, File System, Queue…
- What do we do with the data? — DataFrame transforms
- Where do we store our results? — Database, File System, Queue…
Then, our job is to focus on each piece individually. Develop the readers and test the readers on their own, while we develop and test the data transformation without caring about the reading source either the writing destination.
Defining Spark Readers & Spark Writers
We can inherit from the SparkReader class and then overwrite the reading part to serve our needs.
The next example shows how we provide a SparkMongoReader that extends from SparkReader. Then we can instantiate and use them in theSparkJobs as a SparkReader trait.
The Spark Writers will follow the same approach as the Spark Reader trait.
We define a SparkWrite trait and then we inherit from a SparkJsonWriter.
Both examples are quite simple, however, they allow us to forget in the Spark Job from where are we reading or where are we writing data.
So, if we decide instead of writing JSON files to start writing in Parquet or to publish the data to Redis, we only need to change them outside the Spark Job, without modifying any Spark logic.
Besides, it’s quite easier to test Spark jobs with this approach, since you don’t need to mock any Redis or Kafka, you can provide a FixtureSparkReader which reads from your JSON fixtures or a DataFrameSparkReader that returns a DataFrame instantiated in your UnitTest previously.
Also, we can provide an AssertiveSparkWriter that given an expected DataFrame, it checks against the one that we are going to write without writing in any specific destination.
Spark Tests were feared by our Data Scientists since they used to mock the external databases. Now, they can forget about mocking databases or file systems, they just use the FixtureSparkReader or the DataFrameSparkReader and then the AssertiveSparkWrite for validating their jobs.
Transforms & readability
Spark API provides a method to define DataFrame transformations.
Given a DataFrame and a set of params, it transforms the DataFrame and returns the modified one. This allows us to chain multiple transformations, ending up with a Transformation Pipeline.
In readability terms, it simplifies the code and it allows us to understand better than before the sequence of transformations applied to the data.
In the example above it’s easier to understand how the data is being transformed inside the
compute_ratings_matrix, first, it computes the rating factors, then the relevance scores and finally the idx for users and activities.
Docs: PySpark — SQL
Spark by default provides a huge window of functions that we can use to interact with our DataFrames. From exploding arrays or map to collecting lists or sets into the same field. It has date functions, math, strings, etc.
We recommend having the PySpark SQL documentation in your bookmarks since it’s quite useful and it contains plenty of examples. Since Scala and Python API share the same methods, in case you are using the Scala API, we recommend checking the Python docs since they are easier to understand than the Scala one.
Also, for machine learning related stuff check PySpark — ML.
Since not everything can be achieved by using the predefined functions. Spark allows us to create our own User Defined Functions (UDFs).
UDFs should be defined on their own and never inside another method. So, we can unit test them without worrying about their code context.
In case we need to predefine some external attributes that are not present in the DataFrame, UDFs can be defined as partial functions.
The following UDFs are partial functions where we set up the Personality mapping to be used.
So, it would be great if you could share with us what do you think about this Spark jobs approach. There are some planned iterations that we want to do with our Data Scientists to improve the Spark experience while we close the gap between Spark Engineering and the Spark Data Scientists facets.
Links Of Interest
- Spark Syntax — This is a public repo documenting all of the “best practices” of writing PySpark code. This mainly focuses on the Spark DataFrames and SQL library.
- Spark Quinn — PySpark methods to enhance developer productivity.
- Spark Daria — Essential Scala Spark extensions and helper methods.
- Spark Testing Base — Python & Scala testing library by Holden Karau.
- Spark Fast Tests — Scala testing library by Matthew Powers. Faster than Spark Testing Base, but without streaming support.