Tagged in

Big Data

Data Engineering

Short publication intended to discuss whats what in Data Engineering

More information

Followers

More, on Medium

Big Data

Anup Moncy in Data Engineering

Oct 18, 2023

Troubleshoot Spark/Pyspark performance issues

Steps to help troubleshoot common performance issues in Spark/Pyspark jobs taking EMR/Databricks as example. Of-coarse all these after reviewing there is no change in the data trend or volume.

TL/DR

Get best performance for PySpark jobs using Parallelize

I have seen sometimes even more that 25x speed when operations are using parallize. This does depend on the other workloads on the cluster. Still the difference is significant