Learn PySpark Dataframes With FIFA World Cup & Superheroes Dataset

Published in

Edureka

7 min readJul 12, 2018

Dataframes is a buzzword in the Industry nowadays. People tend to use it with popular languages used for Data Analysis like Python, Scala, and R. So, why is it that everyone is using it so much? Let’s understand this with our article. In this article I’ll be covering the following topics:

What are Dataframes?
Why we need Dataframes?
Features of Dataframes
PySpark Dataframe Sources
Dataframe Creation
Pyspark Dataframes with FIFA World Cup & Superheroes Dataset

What are Dataframes?

Dataframes generally refers to a data structure, which is tabular in nature. It represents Rows, each of which consists of a number of observations. Rows can have a variety of data formats (Heterogeneous), whereas a column can have data of the same data type (Homogeneous). Data frames usually contain some metadata in addition to data; for example, column and row names.

We can say that Dataframes are nothing, but 2-Dimensional Data Structure, similar to an SQL table or a spreadsheet. Now let’s move ahead with this article and understand why exactly we need Pyspark Dataframe?

Why do we need Dataframes?

1.Processing Structured and Semi-Structured Data

Dataframes are designed to process a large collection of structured as well as Semi-Structured data. Observations in Spark DataFrame are organized under named columns, which helps Apache Spark to understand the schema of a DataFrame. This helps Spark optimize execution plan on these queries. It can also handle Petabytes of data.

2.Slicing and Dicing

Data frame APIs usually supports elaborate methods for slicing-and-dicing the data. It includes operations such as “selecting” rows, columns, and cells by name or by number, filtering out rows, etc. Statistical data is usually very messy and contain lots of missing and wrong values and range violations. So a critically important feature of data frames is the explicit management of missing data.

3.Data Sources

DataFrame has support for a wide range of data format and sources, we’ll look into this later on in this article. They can take in data from various sources.

4.Support for Multiple Languages

It has API support for different languages like Python, R, Scala, Java, which makes it easier to be used by people having different programming backgrounds.

Features of Dataframes

Dataframes are Distributed in Nature, which makes it Fault Tolerant and Highly Available Data Structure.
Lazy evaluation is an evaluation strategy which holds the evaluation of an expression until its value is needed. It avoids repeated evaluation. Lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy evaluation comes when Spark transformations occur.
Dataframes are Immutable in nature. By immutable I mean that it is an object whose state cannot be modified after it is created. But we can transform its values by applying a certain transformation, like in RDDs.

PySpark Dataframe Sources

Dataframes in Pyspark can be created in multiple ways:

Data can be loaded in through a CSV, JSON, XML or a Parquet file. It can also be created using an existing RDD and through any other database, like Hive or Cassandra as well. It can also take in data from HDFS or the local file system.

Dataframe Creation

Let’s move forward with this article and understand how to create Dataframes.

We’ll create Employee and Department instances.

from pyspark.sql import *
 
Employee = Row("firstName", "lastName", "email", "salary")
 
employee1 = Employee('Basher', 'armbrust', 'bash@edureka.co', 100000)
employee2 = Employee('Daniel', 'meng', 'daniel@stanford.edu', 120000 )
employee3 = Employee('Muriel', None, 'muriel@waterloo.edu', 140000 )
employee4 = Employee('Rachel', 'wendell', 'rach_3@edureka.co', 160000 )
employee5 = Employee('Zach', 'galifianakis', 'zach_g@edureka.co', 160000 )
 
print(Employee[0])
 
print(employee3)
 
department1 = Row(id='123456', name='HR')
department2 = Row(id='789012', name='OPS')
department3 = Row(id='345678', name='FN')
department4 = Row(id='901234', name='DEV')

Next, we’ll create a DepartmentWithEmployees instance from the Employee and Departments

departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5]) departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4]) departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3]) departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])

Let’s create our Dataframe from the list of Rows

departmentWithEmployees1 = Row(department=department1, employees=[employee1, employee2, employee5])
departmentWithEmployees2 = Row(department=department2, employees=[employee3, employee4])
departmentWithEmployees3 = Row(department=department3, employees=[employee1, employee4, employee3])
departmentWithEmployees4 = Row(department=department4, employees=[employee2, employee3])

Pyspark Dataframes Example 1: FIFA World Cup Dataset

Here we have taken FIFA World Cup Players Dataset. We are going to load this data which is in CSV format into a dataframe and then we’ll learn about the different transformations and actions that can be performed on this dataframe.

Reading Data from CSV file

Let’s load the data from a CSV file. Here we are going to use the spark.read.csv method to load the data into a dataframe fifa_df. The actual method is spark.read.format[csv/json].

fifa_df = spark.read.csv("path-of-file/fifa_players.csv", inferSchema = True, header = True) 
fifa_df.show()

Schema of Dataframe

To have a look at the schema ie. the structure of the dataframe, we’ll use the printSchema method. This will give us the different columns in our dataframe along with the data type and the nullable conditions for that particular column.

fifa_df.printSchema()

Column Names and Count (Rows and Column)

When we want to have a look at the names and a count of the number of Rows and Columns of a particular Dataframe, we use the following methods.

fifa_df.columns //Column Names 
fifa_df.count() // Row Count 
len(fifa_df.columns) //Column Count

37784 
8

Describing a Particular Column

If we want to have a look at the summary of any particular column of a Dataframe, we use the describe method. This method gives us the statistical summary of the given column, if not specified, it provides the statistical summary of the dataframe.

fifa_df.describe('Coach Name').show() fifa_df.describe('Position').show()

Selecting Multiple Columns

If we want to select particular columns from the dataframe, we use the select method.

fifa_df.select('Player Name','Coach Name').show()

Selecting Distinct Multiple Columns

fifa_df.select('Player Name','Coach Name').distinct().show()

Filtering Data

In order to filter the data, according to the condition specified, we use the filter command. Here we are filtering our dataframe based on the condition that Match ID must be equal to 1096 and then we are calculating how many records/rows are there in the filtered output.

fifa_df.filter(fifa_df.MatchID=='1096').show() fifa_df.filter(fifa_df.MatchID=='1096').count() //to get the count

Filtering Data (Multiple Parameters)

We can filter our data based on multiple conditions (AND or OR)

fifa_df.filter((fifa_df.Position=='C') && (fifa_df.Event=="G40'")).show()

Sorting Data (OrderBy)

To sort the data we use the OrderBy method. By Default, it sorts in ascending order, but we can change it to descending order as well.

fifa_df.orderBy(fifa_df.MatchID).show()

PySpark Dataframes Example 2: Superheros Dataset

Loading the data

Here we will load the data in the same way as we did earlier.

Superhero_df = spark.read.csv("path-of file/superheros.csv", inferSchema = True, header = True) 
Superhero_df.show(10)

Filtering the Data

Superhero_df.filter(Superhero_df.Gender == 'Male').count() //Male Heros Count 
Superhero_df.filter(Superhero_df.Gender == 'Female').count() //Female Heros Count

Grouping the Data

GroupBy is used to group the dataframe based on the column specified. Here, we are grouping the dataframe based on the column Race and then with the count function, we can find the count of the particular race.

Race_df = Superhero_df.groupby("Race")\
.count()\
.show()

Performing SQL Queries

We can also pass SQL queries directly to any dataframe, for that we need to create a table from the dataframe using the registerTempTable method and then use the sqlContext.sql() to pass the SQL queries.

Superhero_df.registerTempTable('superhero_table')
sqlContext.sql('select * from superhero_table').show()

sqlContext.sql(‘select distinct(Eye_color) from superhero_table’).show()

sqlContext.sql('select distinct(Eye_color) from superhero_table').count()

sqlContext.sql(‘select max(Weight) from superhero_table’).show()

If you wish to check out more articles on the market’s most trending technologies like Artificial Intelligence, DevOps, Ethical Hacking, then you can refer to Edureka’s official site.

Do look out for other articles in this series which will explain the various other aspects of PySpark.

1. PySpark Tutorial
2. RDDs in PySpark
3. PySpark MLlib Tutorial
4. PySpark Programming