Spark 1.6 java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE

Reason of error:

This error happens when your rdd/dataframe has partitions larger than 2GB.

This is due to the fact that in spark can’t have shuffle blocks larger than 2GB because Spark stores shuffle blocks as ByteBuffer that are limited by Integer.MAX_SIZE (2GB)


If you see the above error in spark 1.6 you can try the followings:

  • Increase the number of partitions of your dataset

spark.sql.shuffle.partitionsor spark.default.parallelism

  • Try to repartition before the given operation (in order to be sure that the number of partitions specified is sufficient enough to make each block size < 2GB)


  • In spark 1.6, unlike RDDs, Spark Datasets/Dataframe cannot use custom partitioner , therefore you can try to address that by converting to rdd, applying a partitioner and then converting to a dataframe again
  • Since in spark 1.6 there is no custom partitioner for dataframes you can try to order your dataframe before repartitioning it


  • If you create a HiveContext you can use the HiveQL DISTRIBUTE BY colX... and CLUSTER BY colX...
hiveContext.sql("select * from partition_df DISTRIBUTE BY column1 SORT BY column1, column2")
  • Spark uses different data structures during shuffle when the number of partitions > 2000. Therefore you could try to increase the number of partitions if you are close to 2000. However you don’t have to increase to much the number of partitions because if not you’ll have other errors (active tasks are negative numbers, too much time for planning … )
  • Try to filter out your data before the step that throws this error

Jira issue:

Image for post
Image for post

Written by

Data Architect | Data Enthusiast

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store