Surprises while splitting CSVs
At carwow, we load all our production Postgres databases into a data warehouse daily; the process is pretty much:
COPY FROMto extract the data as a CSV
- Split the CSV into roughly evenly-sized, gzipped chunks
- Upload these to Amazon S3
- Run a
COPY INTOto load the data
After a quick Google, I found the standard UNIX
split command. Sadly, there’s some problems with this:
splitdoesn’t understand CSVs, so it may separate a single database row across 2 files, if any of the contents of that row includes a newline
splitdoesn’t work on compressed files, and rows may be different sizes, so the resulting files may have vastly different sizes
To resolve these, I wrote a little class to receive lines streamed from the Postgres COPY, and call a block with each file as it completes:
The use looks like:
There’s two main lessons I learned writing this:
Memory isn’t always faster than disk
current_file was a
StringIO, not a
Tempfile. The reasoning being that it would only grow to ~50MB, and writing to disk is slower. Sadly, this estimate was wrong — memory usage could grow to ~100MB per file or more. After a bit of digging, it turns out this is because
GzipWriter is buffering the data in memory before writing it out. Plus, the string would grow over several minutes, making it difficult for the garbage collector to manage.
The huge memory usage of
StringIO would cause Ruby to start using swap memory, which made the whole process much slower, and would often crash due to running out of memory. Having to attempt the export several times was much slower than it working first time!
Sometimes accuracy doesn’t matter
check_interval was set to 1000 rows. This meant each file was almost exactly 50MB. Increasing the interval to 1 million rows means that chunks are sometimes closer to 60MB than 50MB, but for this use-case it doesn’t really matter.
rotate_files_if_necessary, we have to call
GzipWriter#flush, in order to measure how much compressed data has been added to the current file. This causes buffered data to be compressed and written out. Doing this too often means the compression ratio is worse, and reduces throughput.
In summary, it’s always worth checking your initial assumptions, and with real production data. The memory vs. tempfile tradeoff here only applies because Heroku limits memory usage to 512MB in production. In development I have at least 4GB free! Plus, my development databases are far far smaller than the production ones, so these issues were almost impossible to reproduce.
Always test…in production.