Solving the Small File Problem in Iceberg Tables

Thomas Cardenas
Ancestry Product & Technology
3 min readAug 29, 2023

--

Photo by Wesley Tingey on Unsplash

The Data Platform team at Ancestry has been maintaining a fully-refreshed 100-billion-row Apache Iceberg table for several months. A previous article documents how that table was able to be maintained from an update perspective. However, there was one issue that was not mentioned. That was the small file problem. When updates occurred, the number of files would explode from three thousand to two million files after merging 500MB of data.

This caused the need to compact after each run which in turn dramatically increased the cost of S3 requests (PUT, COPY, POST, LIST requests). There were three factors that helped to solve this issue: not relying on defaults, setting shuffle partition, and distribution write mode.

The readers of this data were also greatly impacted. There were out-of-memory issues and the cost grew when reading.

Not Relying on Defaults

Iceberg has a table property to specify how big a file should be. This would be property write.target-file-size-bytes that has a default value of 512MB. This was the case for the Data Platform team’s table. When appending or merging data, the files were a couple kilobytes big, though. When it was compacted, it utilized 512MB. The problem was the need to create the 2 million files, first, to then just reduce it back to a couple hundred files.

-- 536870912 bytes = 512MB
ALTER TABLE catalog.db.table
SET TBLPROPERTIES ('write.target-file-size-bytes'='536870912')

After running the above property to explicitly set the value, the files instantly dropped from 2 million to just over ten thousand files from running the merge operation. This was great! It worked for the append method, too.

Setting Spark Configuration

Ten thousand files was still way too many, though. When joining the dataset during the MERGE operation, one option to further decrease the number of files is to set a common Spark configuration: spark.sql.shuffle.partitions. There are two ways that this value is used: the default value of 200 can be used, or, if reading from a file system like S3, Spark will determine the number of partitions. For an example of the latter, a 512MB file might be split into 10 partitions when reading.

spark-submit --conf spark.sql.shuffle.partitions=100 app.jar

Explicitly setting the value to 100 when running a spark-submit command led to the number of files decreasing to just over a few hundred files when the data set was merged.

Write Distribution Mode

Dropping the number of files was a great cost and performance optimization. Unfortunately, even with setting those configurations when merging data, the memory utilization was still too high.

To solve for this, another table property was found to be helpful: write.distribution.mode=hash. The default value is ‘none’ which indicates for Iceberg not to shuffle the data. Doing hash mode led to further drops in memory utilization as well as the number of files.

ALTER TABLE catalog.db.table 
SET TBLPROPERTIES ('write.distribution.mode'='hash')

Conclusion

Solving the small file problem with Iceberg was a tremendous win. This type of issue would likely have gone unnoticed in much smaller datasets. Now the team will be able to review and examine those, as well, to see if further optimizations are possible there to save on additional storage and processing costs.

The reads do not cost much at all now even for a full table scan although it is sluggish. Iceberg supports S3 Acceleration which may resolve this issue. This will be an upcoming feature to test how it affects read performance and costs.

If you’re interested in joining Ancestry, we’re hiring! Feel free to check out our careers page for more info. Also, please see our medium page to see what Ancestry is up to and read more articles like this.

--

--