Surprises while splitting CSVs

At carwow, we load all our production Postgres databases into a data warehouse daily; the process is pretty much:

  1. Use COPY FROM to extract the data as a CSV
  2. Split the CSV into roughly evenly-sized, gzipped chunks
  3. Upload these to Amazon S3
  4. Run a COPY INTO to load the data

After a quick Google, I found the standard UNIX split command. Sadly, there’s some problems with this:

  1. split doesn’t understand CSVs, so it may separate a single database row across 2 files, if any of the contents of that row includes a newline
  2. split doesn’t work on compressed files, and rows may be different sizes, so the resulting files may have vastly different sizes

To resolve these, I wrote a little class to receive lines streamed from the Postgres COPY, and call a block with each file as it completes:

The use looks like:

There’s two main lessons I learned writing this:

Memory isn’t always faster than disk

Initially, the current_file was a StringIO, not a Tempfile. The reasoning being that it would only grow to ~50MB, and writing to disk is slower. Sadly, this estimate was wrong — memory usage could grow to ~100MB per file or more. After a bit of digging, it turns out this is because GzipWriter is buffering the data in memory before writing it out. Plus, the string would grow over several minutes, making it difficult for the garbage collector to manage.

The huge memory usage of StringIO would cause Ruby to start using swap memory, which made the whole process much slower, and would often crash due to running out of memory. Having to attempt the export several times was much slower than it working first time!

Sometimes accuracy doesn’t matter

At first, check_interval was set to 1000 rows. This meant each file was almost exactly 50MB. Increasing the interval to 1 million rows means that chunks are sometimes closer to 60MB than 50MB, but for this use-case it doesn’t really matter.

In rotate_files_if_necessary, we have to call GzipWriter#flush, in order to measure how much compressed data has been added to the current file. This causes buffered data to be compressed and written out. Doing this too often means the compression ratio is worse, and reduces throughput.

In summary, it’s always worth checking your initial assumptions, and with real production data. The memory vs. tempfile tradeoff here only applies because Heroku limits memory usage to 512MB in production. In development I have at least 4GB free! Plus, my development databases are far far smaller than the production ones, so these issues were almost impossible to reproduce.

Always test…in production.