Reference: https://spark.apache.org/

Start Your Journey with Apache Spark — Part 2

Process structured data using Spark SQL/DataFrame

Neeraj Bhadani
Expedia Group Technology
4 min readSep 17, 2019

--

This is part 2 of the blog series “Start Your Journey with Apache Spark”. In this part, we will discuss “Spark SQL/DataFrames” (I recommend you read part 1 first). You can also refer to Part-3.

Logically, DataFrames are similar to relational tables or DataFrames in Python/R with lots of optimizations behind the scene. There are various ways we can create DataFrames from collections, HIVE tables, Relational tables, and RDDs.

Similar to the sc (Spark Context) in RDDs, we have a SparkSession which is the starting point for the DataFrames API. When we launch spark-shell or pyspark, a SparkSession will be available as “spark”. Otherwise, we can create one as per the below code.

  • Create a DataFrame based on a HIVE table

Basic operations on DataFrames

  • Count the number of rows
  • Access the names of columns in the DataFrame
  • Access the DataType of columns within the DataFrame
  • Check how Spark stores the schema of the DataFrame
  • Print the schema of the DataFrame in a heirarchical manner
  • Display the contents of the DataFrame.

(By default the result will only display 20 records, however, you can specify the number of records to be displayed as the first argument.)

  • Select particular columns from the DataFrame
  • Filter the rows based on some condition. Let’s try to find the rows with id = 1. There are different ways to specify the condition.

Note: In order to use the “col” function, we need to import it.

  • Drop a particular Column

Note: This operation will not drop the column from the “df” DataFrame because DataFrames are immutable in nature. However, it will return a new copy of the DataFrame without that column.

  • Aggregations

We can use the groupBy function to group the data and then use the “agg” function to perform aggregation on grouped data.

  • Sorting

Sort the data based on “id”. By default, sorting will be done in Ascending order.

Sort the data in descending order.

  • Derived Columns

We can use the “withColumn” function to derive the column based on existing columns…

…where age is a derived column based on the expression “current_year — birth_year”.

  • Joins

We can perform various types of joins on multiple DataFrames. For example, let’s try to join 2 DataFrames df1 and df2 based on “id” column.

By default an inner join will be performed. However, we can perform various other joins like “left_outer”, “right_outer”, “full_outer” etc. by passing these as the third argument. For example, for left outer join we can say

  • Executing SQL like queries

We can perform data analysis by writing SQL like queries as well. In order to perform the SQL like queries, we need to register the DataFrame as a Temporary View.

Now we can execute the SQL like queries as below :

  • Saving the DataFrame as a HIVE Table

We can also select the “mode” argument for “overwrite”, “append”, “error” etc.

For example, if we want to overwrite the existing HIVE table we can use :

Note: By default, the operation will save the DataFrame as a HIVE Managed table

  • Saving the DataFrame as a HIVE External table
  • Create a DataFrame from CSV file

We can create a DataFrame using a CSV file and can specify various options like a separator, header, schema, inferSchema, and various other options. Let’s say we have CSV file delimited by “|” which has a header in it, and we would like to generate the schema automatically.

  • Save a DataFrame as a CSV file

Now let’s say we need to save the DataFrame back to a CSV file after performing our analysis, we can do the following:

  • Create a DataFrame from a relational table

We can read the data from relational databases using a JDBC URL.

  • Save the DataFrame as a relational table

We can save the DataFrame as a relational table using a JDBC URL.

I hope you have enjoyed part 2 of this series. In part 3 we will discuss advanced DataFrame operations. You can find the notebooks on github here.

You may also be interested in some of my other posts on Apache Spark.

--

--