The Startup
Published in

The Startup


Troubleshooting Stragglers in Your Spark Application

Stragglers in your Spark Application affect the overall application performance and waste premium resources.

Stragglers are detrimental to the overall performance of Spark applications and lead to resource wastages on the underlying cluster. Therefore, it is important to identify potential stragglers in your Spark Job, identify the root cause behind them, and put required fixes or provide preventive measures.

What is a Straggler in a Spark Application ?:

A straggler refers to a very very slow executing Task belonging to a particular stage of a Spark application (Every stage in Spark is composed of one or more Tasks, each one computing a single partition out of the total partitions designated for the stage). A straggler Task takes an exceptionally high time for completion as compared to the median or average time taken by other tasks belonging to the same stage. There could be multiple stragglers in a Spark Job being present either in the same stage or across multiple stages.

How Stragglers Hurt ?

Even if one straggler is present in a Spark stage, it would considerably delay the execution of the stage with its presence. Delayed execution of one or more stages (due to stragglers), in turn, would have a domino effect on the overall execution performance of the Spark App.

Further, stragglers could also lead to wastage of cluster resources if static resources are configured for the Job. This is due to the fact that unless the stragglers are finished, the already freed static resources of the stage can’t be put to use in the future stages that are dependent on the straggler affected stage.

How to Identify them ?

The presence of stragglers can be felt when you observe that a stage progress bar (available in the Spark UI) corresponding to an executed stage gets stuck in the end. To confirm the same, you could open the stage-specific details listing the summary of Task metrics of completed tasks in the stage. If the Task metric shows that ‘Max duration’ among the completed tasks is exceptionally higher than the ‘Median’ or ‘75th percentile’ value, then it indicates the possibility of stragglers in the corresponding Job. Here is an example showing Task metrics snapshot of a straggler inflicted stage of a Job:

In the above snapshot, it is clearly evident that the corresponding stage suffered from the stragglers, as the ‘Max’ Task duration, which is 28 min, is exceptionally higher than the ‘75th percentile’ value which is only a meager 14 seconds. Further, the individual stragglers can be identified from the detailed Task description section of the stage-specific page. All those tasks whose execution duration lies above the ‘75th percentile’ value can be classified as stragglers for the stage.

What causes Stragglers: Here are some of the common reasons for stragglers problem:

Skewed partitioning: This is one of the widespread and the most common cause of stragglers. Skewed partitioning results in skewed partitions (skewness with respect to data size mapped to each of the partition). And, if the partition data directly reflects the data to be computed upon, then skewed partitions result in a skewed distribution of computation time among tasks assigned for them thereby giving birth to stragglers.

Skewed partitioning can arise in multiple scenarios, the most common being repartitioning of a Dataset or RDD on the basis of partitioning key having a skewed distribution.

Skewed partitioning could also result when a bunch of unsplittable files, skewed by the size, are read into a Dataset/RDD.

Skewed Computation: This is another widespread reason for giving rise to stragglers. If you are doing custom computation on partitions data, it may happen that the computation gets skewed among the partitions owning to dependence on certain attributes/properties of the data (residing in the partition) even though the data among partitions is distributed fairly well. Now, once the computation gets skewed, again it could potentially result in stragglers.

Imbalanced/Skewed computation can arise in scenarios where the computing work in the computing routine of the partition is directly proportional to certain properties of the data records in the partition. An example would be a case where a Dataset points to a collection of FileWrappers Objects, and in a particular partition of this Dataset, all corresponding FileWrapper Objects refer to relatively big files as compared to FileWrapper Objects residing in the other partitions. The particular partition in the example, therefore could give rise to a straggler task.

Slow disk reads/writes: Stragglers can potentially arise for stages, requiring disk read/writes when some of the corresponding tasks are hosted on a server that is suffering from slow disk read/writes.

Stages would involve disk read/writes when they require to execute a shuffle read/write operations or save an intermediate RDD/Dataset on to the disk.

Higher Number of Cores per Executor: Provisioning a higher number of cores per executor, in ranges such as 4~8, could also sometimes lead to potential stragglers. This may happen because of the possibility of simultaneous execution of compute/resource heavy tasks on all the cores of an executor. Simultaneous execution situation could lead to fighting for common resources, such as memory, within the hosting executor which could lead to performance deteriorating events, such as heavy garbage collection, which in turn could potentially give rise to stragglers in the stage.

Possible Remedies:

Fair Partitioning: You need to ensure that the data is fairly partitioned among all the partitions when the computation intensity is directly related to the size of data contained in a partition. If there is flexibility in choosing a re-partitioning key, one should go for a record attribute, as a key, which provides higher cardinality and is evenly distributed among data records. If there is no such flexibility available, and the repartitioning key distribution is highly skewed, one can opt and try salting techniques.

Fair Computation: You also need to ensure that computation is evenly distributed among partitions where computation intensity is not directly related to the size of data in a partition but is dependent on the certain attribute or field in each of the data records.

Increased Number of Partitions: Increasing the number of partitions could decrease the magnitude of performance penalty afflicted by the stragglers. However, this would help to a certain extent only. ( To know more about the Spark partitioning tuning, you could refer to, “Guide to Spark Partitioning: Spark Partitioning Explained in Depth” )

Turning on the Spark speculation feature: Spark speculation feature, which actively identifies slow-running tasks, kills them and again re-launches the same, can optionally be turned on to deal with stragglers. It helps in dealing with stragglers that arise due to resource crunch on executors or due to slow disk/network read/writes on the hosting server. By default, the feature is turned off, one can enable the same by setting the spark config, spark. speculationto true. Further, the speculation feature provides various other knobs too (in terms of spark config) to fine-tune the straggler identification and killing behavior. Here is the description of these knobs:

a) spark.speculation.interval (default: 100ms): How often Spark will check for tasks to speculate. You need not touch the same.

b) spark.speculation.multiplier (default: 1.5): How many times slower a task is than the median to be considered for speculation. This is used as a criterion for identifying the stragglers and can be tuned based on Spark application behavior.

c) spark.speculation.quantile (default: 0.75): Fraction of tasks which must be complete before speculation is enabled for a particular stage. This is used to decide when to launch the attack and kill the stragglers and further re-launch them, this can also be tuned based on Spark application behavior.

Although speculation seems to be a readymade feature to address the straggler menace, try it only after you study your application behavior. Rather, I would advise you to first find the root cause for stragglers and provide fixes accordingly, because killing and re-launch of tasks, at times, could result in inconsistencies in application output.

Here is the Task metrics snapshot of the earlier example after the corresponding application is re-worked to address the straggler’s problem.

As you can be from the snapshot, the max time has now been reduced to 1.7 minutes from the earlier 28 minutes. The original article was published here.

In case, you faced straggler(s) due to a different reason, please mention in the comment about the same and the root cause/remedy you undertook to resolve them. Also, you can participate in the poll on stragglers:



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store