How to: Handling Data Skewness in Apache Spark

Ankush Singh
4 min readJun 13, 2023

--

AI Generated Data Chunk

In the world of big data and distributed computing, one of the most important frameworks that data engineers and data scientists use is Apache Spark. However, when dealing with large volumes of data, it’s crucial to understand a concept that can potentially cripple your Spark application’s performance — Data Skewness. This blog post will help you understand what data skewness is, what can cause it, and how you can overcome it to maintain optimal performance in your Spark applications.

What is Data Skewness?

Data skewness in Apache Spark refers to a condition where the data being processed is not distributed evenly across partitions. In an ideal scenario, data should be uniformly distributed across all the partitions to ensure maximum parallelism and thereby increasing the processing speed. However, real-world data is often not perfectly balanced and when one or a few partitions have a disproportionately high amount of data compared to the others, we have a situation of data skewness.

This data imbalance can drastically impact the performance of your Spark application, leading to longer processing times, inefficient use of resources, and even out-of-memory errors. In the worst-case scenario, a single partition with skewed data can slow down the entire Spark job because the overall completion time of a stage in Spark is determined by the time taken by the slowest task.

Causes of Data Skewness

There are several potential causes of data skewness in Spark applications:

  1. Skewed Data Distribution: Real-world data is often distributed unevenly. Some keys may have a high number of occurrences (hot keys), causing a skewed distribution of data across partitions.
  2. Inadequate Partitioning Strategy: The default partitioning strategy in Spark might not always be the most efficient for your specific dataset. For example, the default Hash Partitioning strategy might cause data skewness if certain keys hash to the same partition.
  3. Join Operations: When performing join operations, if the keys in the datasets being joined are not evenly distributed, it can result in a skew. This is especially prominent in cases where a large dataset is joined with a small dataset on a non-unique key.
  4. GroupBy Operations: Similar to join operations, GroupBy operations can also cause data skewness when certain keys have many more values than others.

Handling Data Skewness in Apache Spark

Although data skewness can impact your Spark application’s performance significantly, several strategies can help manage and mitigate this issue:

  1. Custom Partitioning: Instead of relying on Spark’s default partitioning strategy, implementing a custom partitioning strategy can help distribute data more evenly across partitions. For example, range partitioning can be more effective when dealing with numeric keys.
  2. Salting: Salting is a technique where a random value (salt) is appended to the key, which helps distribute the data more evenly across partitions. This can be particularly useful when dealing with hot keys.
  3. Dynamic Partition Pruning: Dynamic partition pruning is a technique used in Spark to optimize join operations by skipping the scanning of irrelevant partitions in both datasets. This can help improve performance in the case of data skewness caused by join operations.
  4. Splitting Skewed Data: Another strategy is to split the skewed data across multiple partitions. This involves identifying the skewed keys and redistributing the data associated with these keys.
  5. Avoid GroupBy for Large Datasets: When possible, avoid using GroupBy operations on large datasets with non-unique keys. Alternatives such as reduceByKey, which performs a combine operation locally on each partition before performing the grouping operation, can be more efficient.

How to Handle Data Skewness?

Here are a few practical strategies with Scala examples on how to handle data skewness in Apache Spark:

1. Salting Technique

Salting is a useful technique when certain keys (hot keys) in your data have a high number of occurrences, which results in skewness. The idea is to add a random number (salt) to the key, so that the records with the same key are now spread across multiple keys.

Let's see how to apply salting in Spark with Scala:

val spark: SparkSession = ...

// Assume we have a DataFrame with data skewness
val skewedDF: DataFrame = ...

// Define a UDF to add salt
val addSaltUdf = udf((key: String) => key + "_" + Random.nextInt(100))

// Add salt to key
val saltedDF = skewedDF.withColumn("key", addSaltUdf(col("key")))

// Now you can perform operations on saltedDF without data skewness issue

2. Custom Partitioning

If the default partitioning strategy in Spark doesn't distribute your data evenly, you can consider implementing a custom partitioning strategy. Let's take an example of using range partitioning:

val spark: SparkSession = ...

// Assume we have a RDD with data skewness
val skewedRDD: RDD[(Int, String)] = ...

// Custom range partitioner
val customPartitioner = new RangePartitioner(10, skewedRDD)

// Apply the custom partitioner
val partitionedRDD = skewedRDD.partitionBy(customPartitioner)

// Now you can perform operations on partitionedRDD without data skewness issue

3. Splitting Skewed Data

If you can identify the skewed keys, you can split the skewed data and non-skewed data into two separate RDDs/DataFrames and process them separately.

val spark: SparkSession = ...

// Assume we have a DataFrame with data skewness
val skewedDF: DataFrame = ...

// Skewed keys
val skewedKeys = Seq("key1", "key2", "key3")

// Split the data
val skewedData = skewedDF.filter(col("key").isin(skewedKeys: _*))
val nonSkewedData = skewedDF.filter(!col("key").isin(skewedKeys: _*))

// Process skewedData and nonSkewedData separately

Remember, the right strategy to handle data skewness depends on the specifics of your data and application. Applying these strategies correctly can significantly improve the performance of your Spark applications. Keep these strategies in your toolkit for when you next face the challenge of skewed data in Apache Spark.

Read More

  1. What is Catalyst Optimizer
  2. Comparing Data Storage: Parquet vs Arrow

Conclusion

In conclusion, data skewness is a common issue when dealing with big data in Apache Spark. Understanding what causes it and how to handle it effectively can greatly improve your Spark application’s performance and stability. The above strategies are not exhaustive, and the right solution will depend on the specifics of your data and application.

--

--

Ankush Singh

Data Engineer turning raw data into gold. Python, SQL and Spark enthusiast. Expert in ETL and data pipelines. Making data work for you. Freelancer & Consultant