Three mistakes could be made with PySpark

Three common mistakes could be made in your Spark PySpark projects. Those mistakes may confuse your colleagues and also make you pull your hair out.

Published in

CodeX

4 min readJun 5, 2021

After one crazy week of working on a Databricks project, I made a lot of mistakes and hence learned a lot. Here are some tips to share on how to make those mistakes I made.

Use Concatenated Spark SQL string in functions

With PySpark, we can either query a Spark Dataframe with Spark SQL or DataFrame DSL(domain-specific language).

The Spark SQL way:

# create a view from spark dataframe 
sdf.gud.createOrReplaceTempView('sdf_view')
# define you sql query as a string
sql_string = "select * from sdf_view"
# execute the spark SQL
result_df = spark.sql(sqlQuery = sql_string)

With Dataframe DSL, you can query the data without creating any views, almost like what you did with Pandas Dataframe.

result = sdf.select("colomn1","column2")

I got a requirement to build a function that will accept parameter pairs list to query data dynamically. Something like this:

select * from sdf_view 
where  
    condition1 =…

Three mistakes could be made with PySpark

Three common mistakes could be made in your Spark PySpark projects. Those mistakes may confuse your colleagues and also make you pull your hair out.

Use Concatenated Spark SQL string in functions

Written by Andrew Zhu (Shudong Zhu)