The 5S Spark Optimization Series, Part 2: Tackling Skew Optimization for Balanced Excellence!
Content Catalog
- Introduction
- What is Skew?
- What are the issues with Skew?
- What causes Skew to occur?
- Solutions to Skew issues
1. Salting
2. Adaptive Query Engine
3. In-memory Partitioning
4. Bucketing - Conclusion
Introduction
In this article, we will dive deep into the Skew component of the 5S Optimization Framework for Spark. If you have not yet read my previous article on the 5S Optimization Framework Overview, I strongly encourage you to check it out for a better understanding of the framework and its overall approach to optimizing Spark applications.
The 5S Optimization Framework For Spark: Overview
What is Skew?
In distributed data processing, data skew occurs when the data being processed is not evenly distributed across the worker nodes. This can happen due to an imbalance in the size of data between partitions, where one node processes significantly larger data than others.
What are the issues with Skew?
- Longer execution time…