spark dataframes select vs withcolumn

Deepa Vasanthkumar
3 min readJan 25, 2022

In this blog, we try to compare the pyspark methods select and withcolumn.

It is saying that, we need to use df.select than df.withColumn, unless the transformation is involved only for few columns. Why??

That is, In situations where we need to call withcolumn repeateadly, better to a single dataframe.select for that transformation.

Reason being : — DataFrames are immutable hence we cannot change anything directly on it. So every operation on DataFrame results in a new Spark DataFrame. So as many times, withColumn is called repeateadly, we creating a new dataframe on each such operation.


df.withColumn(“salary”,df2(“salary”).cast(“Integer”))
df.withColumn(“copied”,df2(“salary”)* -1)
df.withColumn(“salaryDBL”,df2(“salary”)*100)

✔ df.select(df.salary.cast(IntegerType).as(“salary”), (df.salary * -1).alias(‘copied’), (df.salary * 100).alias(‘salaryDBL’))

To elaborate this further, we use withcolumn for adding new column or any other transformation involving column. This is a handy transformation, if we need to alter schema to add/modify for fewer columns.

Usage:

--

--