Merging Illumina lanes with a Bash script

This tutorial explains how to code a Bash script to perform a simple task with some controls and best practices. A dedicated tool can also do the trick.

Andrea Telatin
#!/ngs/sh
2 min readOct 15, 2018

--

The problem

Illumina NextSeq 500 will produce four separate files for each sample and we want to simply merge them in a single one (this is also true for HiSeq systems, but in that case the lanes are physical and it can be good to keep the file separate).

The solution

Illumina sequences are stored in FASTQ files compressed with Gzip. A typical filename is something like Sample1_S1_L001_R1_001.fastq.gz, the bold in L001 is exactly indicating that this file is the first of a set of four (L001 to L004).

To merge text files we can concatenate them with cat, and the neat thing about Gzipped text-only files is that this applies to them, too! So:

cat File1.gz File2.gz >> File3.gz

Will produce a valid gzipped file.

The script

We want to specify an output directory, to keep the merged files separate from the raw files.

We want to check if the output directory already exists, and in case exit with error (our script will use the append operator, it would be unsafe not to check this).

We want to make use of Bash safety net to prevent some errors.

This task is easy enough to be performed with a one-liner, but the use of a script ensure reproducibility and usually avoidance of errors.

A simple implementation

The script will loop inside an input directory (by default the one the user is located in) with an Illumina like name, and concatenate it into a file with fixed lane name (L001). This ensures compatibility with programs requiring an Illumina naming scheme.

A dedicated tool

SeqFu is a general purpose FASTQ manipulation utility. One of its functions (seqfu lanes) allows to do what our script does, but its 10X faster.

Installinf SeqFu is easy, as it’s shipped via BioConda:

conda install -y -c conda-forge -c bioconda seqfu

Then it’s usage relies on having all the reads in a directory, and then giving an empty directory as output:

seqfu lanes -o /output-dir/ input/dir/

--

--