Get best performance for PySpark jobs using Parallelize

Published in

Data Engineering

2 min readOct 24, 2023

I have seen sometimes even more that 25x speed when operations are using parallize. This does depend on the other workloads on the cluster. Still the difference is significant

TL/DR:

pyspark.SparkContext.parallelize - PySpark 3.5.0 documentation

pyspark.sql.SparkSession.builder.getOrCreate

spark.apache.org

If you have an array of n values that needs to be sent as parameter to execute a function (n times), Typical usage involves writing a for loop, execute n times as a range, passing the current value and execute the function. This method will execute the function ‘n’ times sequentially.Parallelize allows us to execute the function in parallel, hence speeding up exection.

Details:

pyspark.SparkContext.parallelize is a function in SparkContext that is used to create a Resilient Distributed Dataset (RDD) from a local Python collection. This allows Spark to distribute the data across multiple nodes, instead of depending on a single node to process the data.

Explanation with examples:

Without Parallize:

A typical example to invoke a function n times to execute for n values:

# Input list
my_list = [1, 2, 3, 4, 5]

# Function Execution
n=len(my_list)
output = []
for i in range(n):
    output += execute_a_complex_function(i)


# Print the output
print(output)

As you see, the for Loop is executed ’n’ times where ’n’ is the number of values to add

With Parallelize (one of the option to demonstrate the approach):

import pyspark
sc = pyspark.SparkContext('local')

# Input List
my_list = [1, 2, 3, 4, 5]


# New function execution
mapped_rdd = sc.parallelize(my_list).map(lambda element: execute_a_complex_function(element))
output_list = mapped_rdd.collect()

# Print the output
print(output_list)

This code will create an RDD from the list my_list and then apply the lambda function to each element of the RDD in parallel. The results of the map function will then be collected and printed to the console.

To use this code, simply replace the execute_a_complex_function function with your own function.

Note:

This is a powerful option to ensure large volume processing doesn’t go through a serial processing and take hours to run, but please be aware, that this mean, short time the job runs, will use up larger storage and processing.

Also some versions of Databricks cluster I worked on allowed parallize to work only in case the called function works within a driver node. Which means, a complex spark_sql or a write to table failed with an error:

“Could not serialize object: RuntimeError: ….. …. SparkContect can only be used on driver, not in code that run on workers ….. “

Option that worked for me here was to move the complex modules out of the function that needed parallize

Get best performance for PySpark jobs using Parallelize

pyspark.SparkContext.parallelize - PySpark 3.5.0 documentation

pyspark.sql.SparkSession.builder.getOrCreate

Written by Anup Moncy