Performing operations on multiple columns in a Spark DataFrame with foldLeft

The Scala foldLeft method can be used to iterate over a data structure and perform multiple operations on a Spark DataFrame. foldLeft can be used to eliminate all whitespace in multiple columns or convert all the column names in a DataFrame to snake_case.

foldLeft is great when you want to perform similar operations on multiple columns. Let’s dive in!

If you’re using the PySpark API, see this blog post on performing multiple operations in a PySpark DataFrame.

Eliminating whitespace from multiple columns

Let’s create a DataFrame and then write a function to remove all the whitespace from all the columns.

val sourceDF = Seq(
(" p a b l o", "Paraguay"),
("Neymar", "B r asil")
).toDF("name", "country")

val actualDF = Seq(
"name",
"country"
).foldLeft(sourceDF) { (memoDF, colName) =>
memoDF.withColumn(
colName,
regexp_replace(col(colName), "\\s+", "")
)
}

actualDF.show()
+------+--------+
| name| country|
+------+--------+
| pablo|Paraguay|
|Neymar| Brasil|
+------+--------+

We can improve this code by using the DataFrame#columns method and the removeAllWhitespace method defined in spark-daria.

val actualDF = sourceDF
.columns
.foldLeft(sourceDF) { (memoDF, colName) =>
memoDF.withColumn(
colName,
removeAllWhitespace(col(colName))
)
}
Remember to use spark-daria for generic data wrangling like removing whitespace from a string. Generic data wrangling functions limit code complexity and yield code that’s more readable.

snake_case all columns in a DataFrame

It’s easier to work with DataFrames when all the column names are in snake_case, especially when writing SQL. Let’s used foldLeft to convert all the columns in a DataFrame to snake_case.

val sourceDF = Seq(
("funny", "joke")
).toDF("A b C", "de F")

sourceDF.show()
+-----+----+
|A b C|de F|
+-----+----+
|funny|joke|
+-----+----+
val actualDF = sourceDF
.columns
.foldLeft(sourceDF) { (memoDF, colName) =>
memoDF
.withColumnRenamed(
colName,
colName.toLowerCase().replace(" ", "_")
)
}

actualDF.show()
+-----+----+
|a_b_c|de_f|
+-----+----+
|funny|joke|
+-----+----+

Wrapping foldLeft operations in custom transformations

We can wrap foldLeft operations in custom transformations to make them easily reusable. Let’s create a custom transformation for the code that converts all DataFrame columns to snake_case. Custom transformations are easier to reuse on multiple DataFrames.

def snakeCaseColumns(df: DataFrame): DataFrame = {
df.columns.foldLeft(df) { (memoDF, colName) =>
memoDF.withColumnRenamed(colName, toSnakeCase(colName))
}
}

def toSnakeCase(str: String): String = {
str.toLowerCase().replace(" ", "_")
}
val sourceDF = Seq(
("funny", "joke")
).toDF("A b C", "de F")
val actualDF = sourceDF.transform(snakeCaseColumns)
actualDF.show()
+-----+----+
|a_b_c|de_f|
+-----+----+
|funny|joke|
+-----+----+

The snakeCaseColumns custom transformation can now be reused for any DataFrame. This transformation is already defined in spark-daria by the way.

See this blog post if you’d like more background information on custom DataFrame transformations in Spark.

Next steps

If you’re still uncomfortable with the foldLeft method, try the Scala collections CodeQuizzes. You should understand foldLeft in Scala before trying to apply foldLeft in Spark.

Whenever you’re applying a similar operation to multiple columns in a Spark DataFrame, try to use foldLeft. It will reduce the redundancy in your code and decrease your code complexity. Try to wrap your foldLeft calls in custom transformations to make beautiful functions that are reusable! 😉