Three mistakes could be made with PySpark

Three common mistakes could be made in your Spark PySpark projects. Those mistakes may confuse your colleagues and also make you pull your hair out.

Andrew Zhu (Shudong Zhu)
CodeX

--

Image by my boy, Charles Zhu

After one crazy week of working on a Databricks project, I made a lot of mistakes and hence learned a lot. Here are some tips to share on how to make those mistakes I made.

Use Concatenated Spark SQL string in functions

With PySpark, we can either query a Spark Dataframe with Spark SQL or DataFrame DSL(domain-specific language).

The Spark SQL way:

# create a view from spark dataframe 
sdf.gud.createOrReplaceTempView('sdf_view')
# define you sql query as a string
sql_string = "select * from sdf_view"
# execute the spark SQL
result_df = spark.sql(sqlQuery = sql_string)

With Dataframe DSL, you can query the data without creating any views, almost like what you did with Pandas Dataframe.

result = sdf.select("colomn1","column2")

I got a requirement to build a function that will accept parameter pairs list to query data dynamically. Something like this:

select * from sdf_view 
where
condition1 =…

--

--