Three mistakes could be made with PySpark

Three common mistakes could be made in your Spark PySpark projects. Those mistakes may confuse your colleagues and also make you pull your hair out.

Andrew Zhu (Shudong Zhu)
CodeX
Published in
4 min readJun 5, 2021

--

Image by my boy, Charles Zhu

After one crazy week of working on a Databricks project, I made a lot of mistakes and hence learned a lot. Here are some tips to share on how to make those mistakes I made.

Use Concatenated Spark SQL string in functions

With PySpark, we can either query a Spark Dataframe with Spark SQL or DataFrame DSL(domain-specific language).

The Spark SQL way:

# create a view from spark dataframe 
sdf.gud.createOrReplaceTempView('sdf_view')
# define you sql query as a string
sql_string = "select * from sdf_view"
# execute the spark SQL
result_df = spark.sql(sqlQuery = sql_string)

With Dataframe DSL, you can query the data without creating any views, almost like what you did with Pandas Dataframe.

result = sdf.select("colomn1","column2")

I got a requirement to build a function that will accept parameter pairs list to query data dynamically. Something like this:

select * from sdf_view 
where
condition1 = value 1
<and condition2 = value 2>
...

The conditions lines in whereclause are growing, which means the logic expression is undefined until the real parameters are passing in.

I thought Spark SQL is more used for Data Analysts and should be easier to transplant our existing Kusto code to PySpark code. So, I decided to use Spark SQL instead of the Dataframe DSL style.

To achieve the dynamic filters, I have to create a Spark SQL template.

sql_string = """
select * from sdf_view
where 1 = 1
<sub_conditions>
"""

Then, use a Python loop to replace the <sub_conditions> based on input parameters. An input parameter is a dictionary object.

for k,v in input_dict.items():
sql_string = sql_string + \
" and array_contains(array_column,'"+k+":"+v+"')"

--

--