Transforming Spark Datasets using Scala transformation functions

Published in

Analytics Vidhya

3 min readOct 1, 2019

There are few times where I’ve chosen to use Spark as an ETL tool for it’s ease of use when it comes to reading/writing parquet, csv or xml files.

Reading any of these file formats is as simple as one line of spark code (after you ensure you have the required dependencies of course)

Intent

Most of the reference material available online for transforming Datasets points to calling createOrReplaceTempView() and registering the Dataset/Dataframe as a table, then performing SQL operations on it.

While this may be fine for most use cases, there are times when it feels more natural to use Spark’s Scala transformation functions instead, especially if you already have some pre-existing transformation logic written in Scala or ReactiveX or even Java 8.

If this sounds like something you’d like to do, then please read on.

Getting Started

Let’s assume we want to make the following transformation using Datasets.

Domain objects and the transformation function

We’ll assume we have the following domain objects and transformation function that converts a FlightSchedule object into a FlightInfo object.

Creating the spark session and reading the input Dataset

Creating the input Dataset is kept simple for brevity.

Defining the Encoder and the Spark transformation

This is where things start to get interesting. In order to transform a Dataset[FlightSchedule] to a Dataset[FlightInfo], Spark needs to know how to “Encode” your case class. Omitting this step will give you the following dreadful compile time error.

Error:(34, 18) Unable to find encoder for type stored in a Dataset.  Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._Support for serializing other types will be added in future releases.
schedules.map(s => existingTransformationFunction(s))

Encoders[T] are used to convert any JVM object or primitive of type T to and from Spark SQL’s InternalRow representation. Since the Dataset.map() method requires an encoder to be passed as an implicit parameter, we’ll define an implicit variable.

Transforming the Dataset

The only thing left to do now is call the transform method on the input Dataset. I will include the entire code here along with calls to show() so that we can see our results.

References

Spark SQL Encoders

Also posted on : https://olivermascarenhas.com/2019-09-25-how-to-transform-a-spark-dataset/