spark dataframes select vs withcolumn

Deepa Vasanthkumar
3 min readJan 25, 2022

--

In this blog, we try to compare the pyspark methods select and withcolumn.

It is saying that, we need to use df.select than df.withColumn, unless the transformation is involved only for few columns. Why??

That is, In situations where we need to call withcolumn repeateadly, better to a single dataframe.select for that transformation.

Reason being : — DataFrames are immutable hence we cannot change anything directly on it. So every operation on DataFrame results in a new Spark DataFrame. So as many times, withColumn is called repeateadly, we creating a new dataframe on each such operation.


df.withColumn(“salary”,df2(“salary”).cast(“Integer”))
df.withColumn(“copied”,df2(“salary”)* -1)
df.withColumn(“salaryDBL”,df2(“salary”)*100)

✔ df.select(df.salary.cast(IntegerType).as(“salary”), (df.salary * -1).alias(‘copied’), (df.salary * 100).alias(‘salaryDBL’))

To elaborate this further, we use withcolumn for adding new column or any other transformation involving column. This is a handy transformation, if we need to alter schema to add/modify for fewer columns.

Usage:

This can be used to change the datatype of column

df1.withColumn(“newID”,col(“id”).cast(“Integer”))

This can be used to update existing column

df = df1.withColumn(“id”, col(“id”) +”-1000")

This can be used to add a new column

df = df1.withColumn(“temp”, col(“id”) +”-1000")

Spark Documentation:
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame

this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.

withcolumn, can be to an extent, rewritten using select like

df1.withColumn(“name”,col(“firstname”)) ➡️ df1.select(“*”, (df1.firstname).alias(“name”))

📌 📌 A word of caution while using select ♟ : #select doesnt validate whether the column is already present in that dataframe.

Given a dataframe df1

This below statement works for both select and withcolumn. However behaviour is different.

#withColumn
df5= df1.withColumn(“lastname”,col(“firstname”))
df5.show()

withcolumn replaced already existing column with newer values.

#select

df6 = df1.select(“*”, (df1.firstname).alias(“lastname”))
df6.show()

Now — we see that there are two columns with same name, with different values and now this dataframe is kind of ambiguous, no further select or drop on these columns can be performed, as it leads to AnalysisException.

Example :

📌 📌 📌 Use #select when you already know the columns (schema) ahead and that there will not be any duplication.

So eventually each call to withColumn method calls withColumns which validates the columns for duplication and then translates the method to select.

Essentially, then each withColumn withColumns (to check schema for duplication) + select(replacedAndExistingColumns ++ newColumns : _*)

Thank you for supporting and don’t forget to 👏👏

--

--