Raw sequencing datasets are big but imminently compressible. This article compares some simple ways to compress sequencing data.

For the comparison I will use three metagenomic samples. Metagenomic samples are harder to compress than samples from humans because the sequences are less redundant. I will compare five compression techniques. Three of these will be familiar to most readers: gzip, bzip2, and BAM files. Two will be less familiar: blocked-fastqs and blocked-fastqs without quality scores.

Blocked-fastq files consist of reorganizing a fastq so that records of the same type come one after the other. Normally fastqs are grouped in four line…

David Danko

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store