Hacking with Spark DataFrame

Happy data science

(λx.x)eranga
Effectz.AI
3 min readMay 19, 2019

--

RDD vs DataFrame

Resilience Distributed Dataset, RDD is an immutable distributed dataset which partitioned across the cluster nodes. RDD released with Spark1.0. It is the fundamental data structure of Spark. RDD provides compile time type safety. Downside of RDD is it consumes large memory and involve overhead of Garbage Collections since it keeps in-memory JVM objects.

DataFrame is an immutable distributed collection of data. DataFrame released with Spark 1.3. Unlike an RDD, DataFrame organizes data into named columns. It gives us a view of data as columns with column name and types info, as like relational database table. Spark improved the performance of DataFrame by using Custom Memory Management and Optimized Execution Plans. Downside of DataFrame is it does not provide compile time safety, it can causes run time errors. The DataFrame API doesn’t look programmatic and provide more of SQL kind features.

Even though DataFrame/DataSet give an abstractions in front, behind the scene all the computation happens with RDDs. We can go from a DataFrame to an RDD via its rdd method. Also we can convert RDD to a DataFrame (if the RDD is in a tabular format) via the toDF method.

DataFrame with Scala

In my previous post I have introduced about spark, spark architecture and examples with RDD and Scala. In this post I’m gonna show about various types of analytics that can be performed with spark DataFrame. All the source codes which related to this post are available in gitlab. Please clone the repository and continue the post.

Sbt dependency

I’m using IntelliJ Idea as my IDE to work with Scala spark applications. First I need to create sbt project and add the build.sbt dependency file with spark dependencies. To work with DataFrame we need spark-sql dependency. Following is the build.sbt dependency file.

Spark session

First we need to create SparkSession. Following is the way to create SparkSession with session builder.

Load DataFrame

DataFrame can be loaded from .CSV files. To that first we need to define the schema(fields and types) for the the .CSV file. Then we can load the .CSV data to DataFrame. In following example I’m creating a DataFrame from uber.csv file which locates on resources directory.

Create DataFrame

DataFrame can be created with Scala Seq. Following is the way to do that. It uses toDf function which available from import spark.implicits._.

printSchema()

We can view the schema of the DataFrame with printSchema() function. It will shows the fields in the DataFrame and respective data types.

show()

To view the data on DataFrame we can use show() function. We can pass no of rows that that need to be shown.

describe()

describe() function can be used view the summary of the DataFrame. It shows count, mean, min, max, stddev like matrixes of the data on DataFrame.

select()

Selecting specific fields can be done with select() function. Following example select Device, Event and Time fields from the DataFrame.

filter()

Filtering can be done with filter() function. We can pass multiple filtering logics to it. Following is the way to do that.

sort()

sort() function can be used to sort the records in the DataFrame based on a field. Following examples sort the DataFrame based on Event field.

distinct()

distinct() function can used to get different values of single or multiple columns in DataFrame. Following are some examples.

groupBy()

groupBy function can be used to group similar fields.count(), sum(), avg(), min(), max() like function can be combined with groupBy() to add new column to the DataFrame.

window()

window() function can be combine with groupBy() to group the records with specific time window. Following example groups the records in DataFrame based on Device field with time interval 5 minutes.

arrayType

DataFrame can contains arrayType fields. array_contains function can be used to To filter the the elements from arrayType fields. Following is an example which creates DataFrame with arrayType fields and filter with array_contains.

structType

DataFrame can contains structType nested objects. Following example shows the operations that we can do with the structType object and lists.

Reference

  1. https://www.linkedin.com/pulse/apache-spark-rdd-vs-dataframe-dataset-chandan-prakash
  2. https://medium.com/rahasak/hacking-with-apache-spark-f6b0cabf0703
  3. https://hortonworks.com/tutorial/dataframe-and-dataset-examples-in-spark-repl/
  4. https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-1/
  5. http://allaboutscala.com/big-data/spark/

--

--