Spark Select and Select-expr Deep Dive

somanath sankaran
Nov 30, 2019 · 3 min read

This is my third in spark deep dive series after deep dive of reading csv

This post consists of dealing select and filter expression in pyspark

  1. Select and and alias column
  2. Flexible SelectExpr (for Hive People)
  3. Leveraging Python power(List Comprehension) with select

Step1: Creating Input DataFrame

We will create df using read csv method of Spark Session

Step2:Select in DF

As per documentation df.select with accept

1.List of String

2.List Of Column

3.expression

4. “*”(star)

1.List of String:

We can pass column names as list of python String Object

2.List Of Column

we can import spark Column Class from pyspark.sql.functions and pass list of columns

4.Star(“*”):

Star Syntax basically selects all the columns similar to select * in sql

Step2:Select with Alias:

One common use-case is doing some manipulation and assigning the data as a new Dataframe instead of show

For example we will multiply fair by 70 and convert it to indian INR from US Dollars and assign column name as Fare_INR

Here we are selecting all the columns and adding a new colum as Indian_INR

Flexible SelectExpr and alias column

If you are a sql /Hive user so am I and if you miss the case statement in spark.

Dont worry selectExpr comes to the rescue

1.SelectExpr is useful for flexible sql statements and adding the fields

2.Using All In-Built Hive Functions like length

3.Casting Datatypes is easy with selectExpr

Here we are casting dtypes of Survived from string to int

4.Adding Constants with SelectExpr

One common use case is to add constant fields like current_date

which can be done easily with SelectExpr

3.Leveraging Python power(List Comprehension) with select

Since select accept List we can use List Comprehension to select certain columns . Say User has given a list of column to manipulate the df

we can use List comprehension(or generator Expression) with if clause to select the required columns

Github Link: https://github.com/SomanathSankaran/spark_medium/tree/master/spark_csv

Next Tutorial : https://medium.com/@somanathsankaran/spark-group-by-and-filter-deep-dive-5326088dec80

Please post me with topics in spark which I have to cover and provide me with suggestion for improving my writing :)

Learn and let others Learn!!

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data…

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

somanath sankaran

Written by

Big Data Developer interested in python and spark

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com