Exploring Spark UDFs

Exploring Spark User Defined functions with example

Published in

Analytics Vidhya

3 min readJul 30, 2019

Hey Fellas,

Many of us would have faced serious issues while performing lambda functions in spark Dataframes and in RDDs. Even though lambda functions are very much easy and useful at the sametime, doing some tasks using lambda functions is bit cumbersome sometimes.

So, if you all are bit stuck when applying lambda functions, let me introduce you all to the concept of UDFs(User Defined Functions) in Spark, it does something like…..

Correct, spark UDFs are basically passing a python function to be used in Spark Dataframes or RDDs(bit orthodox I reckon), interesting isn’t it!

Let me help you explore more in this concept by an example. I have dataset which consists occupational data as shown in figure.

So, I want to get count of occupations in a particular age group of range 10, i.e. 10–20, 20–30,… So to get the age group we will be using a Spark UDF function.

Let me start by reading the dataset…

userDF = spark.read.option(“header”, “true”) \
 .option(“delimiter”, “;”) \
 .option(“inferSchema”, “true”) \
 .csv(“user.csv”) \
 .select(‘userid’,’age’,’gender’,’occupation’)

So, now we have the dataset, lets quickly make a python function using if-else conditional chain to divide ages.

def age_group(age):
 if age < 10:
 return ‘0–10’
 elif age < 20:
 return ‘10–20’
 elif age < 30:
 return ‘20–30’
 elif age < 40:
 return ‘30–40’
 elif age < 50:
 return ‘40–50’
 elif age < 60:
 return ‘50–60’
 elif age < 70:
 return ‘60–70’
 elif age < 80:
 return ‘70–80’
 else:
 return ‘80+’

Now that our python function is ready, we will make a spark UDF function and pass the python function in it.

from pyspark.sql.functions import udf
agegroup_udf = udf(lambda z: age_group(z))
userDFAgeGroup = userDF.withColumn(‘AgeGroup’,agegroup_udf(F.col(‘age’)))

So, basically we will get agegroup_udf function which now can be passed in Dataframe to perform operations on its columns. So, I have applied the UDF function on age column to get AgeGroup (so obvious!).

Lets check our newly born dataframe’s health

Hallelujah, we got the AgeGroup column by applying UDF, wasn’t it simple? Now lets aggregate the data based on our newly added column and get the final result.

userDF_count = userDFAgeGroup.groupBy(F.col(‘AgeGroup’),F.col(‘occupation’)) \
 .agg(F.count((F.col(‘occupation’))).alias(‘count’))

Let me show you the end result I got.

Phew, finally we reached our landmark and got the count per each age group and occupation. The uses and benefits of UDFs are endless as this example just opening a small window in the world of UDFs.

Hope you liked my article, if you want to checkout full code checkout my github link below:

https://github.com/AbidMerchant/Pyspark_udf_partition

That’s all Folks! Ta Ta

Exploring Spark UDFs

Exploring Spark User Defined functions with example

Written by Abid Merchant