Compressing FASTQ files for long term storage

The raw output of NGS experiments should be backed, and format-specific compressor like “dsrc2” comes handy for this

Andrea Telatin
#!/ngs/sh
Published in
3 min readFeb 7, 2019

--

Let’s skip the introductory comments about the frustration of seeing our disk quota almost full in any cluster we use for NGS data: it’s a well-known story, and as soon as we keep out of sight any unnecessary large file (.sam files anyone?) we probably need to gain some extra space.

FASTQ files must be backed up

The definitive long-term storage for FASTQ file is a public repository. Both NCBI and EBI allow for easy download of raw reads using command line tools, so there are no excuses not to upload our data as soon as possible.

At the same time, we often need a local copy for long times and for several reasons, and if this is the case we better look for a good compressor for FASTQ files.

Desirable features

  • compression ratio: we want the compressed file to be small, possibly smaller than what standard compression formats like .gz can achieve
  • speed: we can improve compression speed using programs like PIGZ, already discussed here, but why not, it could be better to achieve lower times.
  • open source: we want the compressor to be open source and easy to install and run in our Linux systems.

I tried “dsrc2” six years ago, and have been satisfied

FASTQ format is quite simple and this allows for format-specific compressors. There are plenty of them, and it’s worth considering the alternatives when choosing one. In my opinion it is good to look for a loss-less compressor.

When, in 2014, I was working for a small company I tried some of the available tools, and decided that dsrc2 was a good candidate. The only reason why I only recommend this tool is that I had been using it since then and it never failed: this is not an extensive test, but at least it wasn’t completely abandoned after publication, as often happens to promising tools.

Simple test

Compressing a FASTQ file is as simple as:

dsrc c -tTHREADS input.fastq output.dsrc2

A small example is a 4.6 Gb FASTQ file, that compressed with dsrc2 will become 431 Mb (in 20"), while using pigz will become 853 Mb (in 2' 11").

Docker container

Want to test it with Docker? If you have some .fastq files in your current directory:

sudo docker run --rm -v $PWD:/data andreatelatin/dsrc2 \
dsrc c /data/test.fq /data/test.fq.dsrc

It’s a single line, but for clarity I divided the command in two parts: the second is the actual dsrc command, but referring to files in /data rather than their actual position. This is thanks to the -v $PWD:/data parameter, that mapped my current position in the client with the /data directory in the docker image.

Final comments

Compressing FASTQ file will be probably an old fashioned habit soon, as our infrastructures improve and become more robust at handling this transparently and efficiently for us.

A major problem of dsrc is that is optimized for Illumina, and I never tested it with long reads projects.

Do you have long time experience with another compression tool for FASTQ files? I’d love to have some feedback!

--

--