The 5S Spark Optimization Series, Part 2: Tackling Skew Optimization for Balanced Excellence!

6 min readApr 23, 2023

Content Catalog

Introduction
What is Skew?
What are the issues with Skew?
What causes Skew to occur?
Solutions to Skew issues
1. Salting
2. Adaptive Query Engine
3. In-memory Partitioning
4. Bucketing
Conclusion

Introduction

In this article, we will dive deep into the Skew component of the 5S Optimization Framework for Spark. If you have not yet read my previous article on the 5S Optimization Framework Overview, I strongly encourage you to check it out for a better understanding of the framework and its overall approach to optimizing Spark applications.

The 5S Optimization Framework For Spark: Overview

What is Skew?

In distributed data processing, data skew occurs when the data being processed is not evenly distributed across the worker nodes. This can happen due to an imbalance in the size of data between partitions, where one node processes significantly larger data than others.

What are the issues with Skew?

Longer execution time…

The 5S Spark Optimization Series, Part 2: Tackling Skew Optimization for Balanced Excellence!

Content Catalog

Introduction

What is Skew?

What are the issues with Skew?

Written by Chenglong Wu