Different approaches to manually create Spark DataFrames

Matthew Powers
May 22, 2017 · 2 min read

This blog post explains the Spark and spark-daria helper methods to manually create DataFrames for local development or testing.

We’ll demonstrate why the createDF() method defined in spark-daria is better than the toDF() and createDataFrame() methods from the Spark source code.

toDF()

toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark implicits.

import spark.implicits._

The toDF() method can be called on a sequence object to create a DataFrame.

val someDF = Seq(
  (8, "bat"),
  (64, "mouse"),
  (-27, "horse")
).toDF("number", "word")

someDF has the following schema.

root
 | — number: integer (nullable = false)
 | — word: string (nullable = true)

toDF() is limited because the column type and nullable flag cannot be customized. In this example, the number column is not nullable and the word column is nullable.

The import spark.implicits._ statement can only be run inside of class definitions when the Spark Session is available. All imports should be at the top of the file before the class definition, so toDF() encourages bad Scala coding practices.

toDF() is suitable for local testing, but production grade code that’s checked into master should use a better solution.

createDataFrame()

The createDataFrame() method addresses the limitations of the toDF() method and allows for full schema customization and good Scala coding practices.

Here is how to create someDF with createDataFrame().

val someData = Seq(
  Row(8, "bat"),
  Row(64, "mouse"),
  Row(-27, "horse")
)

val someSchema = List(
  StructField("number", IntegerType, true),
  StructField("word", StringType, true)
)

val someDF = spark.createDataFrame(
  spark.sparkContext.parallelize(someData),
  StructType(someSchema)
)

createDataFrame() provides the functionality we need, but the syntax is verbose. Our test files will become cluttered and difficult to read if createDataFrame() is used frequently.

createDF()

createDF() is defined in spark-daria and allows for the following terse syntax.

val someDF = spark.createDF(
  List(
    (8, "bat"),
    (64, "mouse"),
    (-27, "horse")
  ), List(
    ("number", IntegerType, true),
    ("word", StringType, true)
  )
)

createDF() creates readable code like toDF() and allows for full schema customization like createDataFrame(). It’s the best of both worlds.

Big shout out to Nithish for writing the advanced Scala code to make createDF() work so well.

Including spark-daria in your projects

The spark-daria README provides the following project setup instructions.

  1. Add the sbt-spark-package plugin to your application. The spark-daria releases are maintained in Spark Packages.
  2. Update your build.sbt file.
spDependencies += "mrpowers/spark-daria:0.5.0"

3. Import the spark-daria code into your project:

import com.github.mrpowers.spark.daria.sql.SparkSessionExt._

Closing Thoughts

I want to help build a vibrant Spark open source community and collaborate on third party libraries that make Spark developers more productive.

Please submit pull requests, raise issues, or send me feature requests so we can continue improving spark-daria!

464

464 claps
Matthew Powers

Written by