Dealing with Large gzip Files in Spark

Prasanna Parasurama
2 min readSep 15, 2019

--

I was recently working with a large time-series dataset (~22 TB), and ran into a peculiar issue dealing with large gzipped files and spark dataframes.

The raw data was already on the hadoop file system (HDFS), partitioned by year-month-date. Each date had multiple parts, and each each part was a gzipped csv file ~4GB. The task at hand was to process this raw data and save the processed data in parquet format:

  1. Cast datatypes properly
  2. Separate time-variant and time-invariant data
  3. Sort data logically
  4. Save dataframes in parquet format

Easy enough to do with spark dataframes… or so I thought.

Problem — Memory Issues with Dataframes

The first naive approach was to process the data and repartition in one go with spark dataframes, which quickly ran into memory issues. Since gzipped files are not splittable, each part file was being processed within a single executor. In theory, increasing the executor memory should’ve worked. In practice, the job still ran into memory issues.

Next, I tried repartitioning the data into smaller chunks (with spark dataframe) and write back to HDFS. This also quickly ran into memory issues.

Solution

One solution is to avoid using dataframes and use RDDs instead for repartitioning: read in the gzipped files as RDDs, repartition them so each partition is small, save them in a splittable format (for example, snappy).

Once all the files are repartitioned, we can read in the snappy partitions as spark dataframes and process as usual.

This solution worked without any hiccups, though I don’t fully understand why repartitioning worked with RDDs but not dataframes.

Alternate Solutions?

I’d be curious to know:

  • why the memory issue with repartitioning was specific to spark dataframes and not RDDs
  • if there are faster/more efficient solutions to address this issue

--

--