I have seen sometimes even more that 25x speed when operations are using parallize. This does depend on the other workloads on the cluster. Still the difference is significant
Steps to help troubleshoot common performance issues in Spark/Pyspark jobs taking EMR/Databricks as example. Of-coarse all these after reviewing there is no change in the data trend or volume.
TL/DR
Most data platforms have evolved and so has the way we write SQLs. Although the basic fundamentals remain the same. I thought it would be useful to write about not so talked about SQL features.