The 5S Spark Optimization Series, Part 2: Tackling Skew Optimization for Balanced Excellence!

Chenglong Wu
6 min readApr 23, 2023

Content Catalog

Introduction

In this article, we will dive deep into the Skew component of the 5S Optimization Framework for Spark. If you have not yet read my previous article on the 5S Optimization Framework Overview, I strongly encourage you to check it out for a better understanding of the framework and its overall approach to optimizing Spark applications.

The 5S Optimization Framework For Spark: Overview

What is Skew?

In distributed data processing, data skew occurs when the data being processed is not evenly distributed across the worker nodes. This can happen due to an imbalance in the size of data between partitions, where one node processes significantly larger data than others.

What are the issues with Skew?

  1. Longer execution time

--

--