This blog post explains the Spark and spark-daria helper methods to manually create DataFrames for local development or testing.
We’ll demonstrate why the
createDF() method defined in spark-daria is better than the
createDataFrame() methods from the Spark source code.
toDF() provides a concise syntax for creating DataFrames and can be accessed after importing Spark implicits.
toDF() method can be called on a sequence object to create a DataFrame.
val someDF = Seq( (8, "bat"), (64, "mouse"), (-27, "horse") ).toDF("number", "word")
someDF has the following schema.
root | — number: integer (nullable = false) | — word: string (nullable = true)
toDF() is limited because the column type and nullable flag cannot be customized. In this example, the
number column is not nullable and the
word column is nullable.
import spark.implicits._ statement can only be run inside of class definitions when the Spark Session is available. All imports should be at the top of the file before the class definition, so
toDF() encourages bad Scala coding practices.
toDF() is suitable for local testing, but production grade code that’s checked into master should use a better solution.
createDataFrame() method addresses the limitations of the
toDF() method and allows for full schema customization and good Scala coding practices.
Here is how to create
val someData = Seq( Row(8, "bat"), Row(64, "mouse"), Row(-27, "horse") ) val someSchema = List( StructField("number", IntegerType, true), StructField("word", StringType, true) ) val someDF = spark.createDataFrame( spark.sparkContext.parallelize(someData), StructType(someSchema) )
createDataFrame() provides the functionality we need, but the syntax is verbose. Our test files will become cluttered and difficult to read if
createDataFrame() is used frequently.
createDF() is defined in spark-daria and allows for the following terse syntax.
val someDF = spark.createDF( List( (8, "bat"), (64, "mouse"), (-27, "horse") ), List( ("number", IntegerType, true), ("word", StringType, true) ) )
createDF() creates readable code like
toDF() and allows for full schema customization like
createDataFrame(). It’s the best of both worlds.
Including spark-daria in your projects
The spark-daria README provides the following project setup instructions.
- Add the sbt-spark-package plugin to your application. The spark-daria releases are maintained in Spark Packages.
- Update your
spDependencies += "mrpowers/spark-daria:0.5.0"
3. Import the spark-daria code into your project:
I want to help build a vibrant Spark open source community and collaborate on third party libraries that make Spark developers more productive.
Please submit pull requests, raise issues, or send me feature requests so we can continue improving spark-daria!