In the good old days shared servers or HPC clusters gave the possibility to have a software shelf using “modules”. This can be still a valid alternative in shared machines.

Miniconda became the most popular package manager among bioinformaticians, with the dedicaded “BioConda” channel serving 8.000 packages. Its popularity relies on the ease of installation of the manager itself, that does not require administration privileges.

Conda, however, is essentialy a per user solution, and it’s not easy to make shared conda installations (and sometimes it’s not even convenient to try to do so).

What is a module?

In simple words, it’s a path that you add to your own $PATH. But it’s easy to add and remove that path and it will have a nice nickname.

How to use modules

The commands to use modules are quite straightforward:

A short note on avoiding typing your password at every push

The unsafe way of caching git credentials is:

git config --global "your username"
git config --global user.password "your password"
git config --global credential.helper store

This is a general procedure that can be used for any Git server, but considering it’s storing your password in plain text this could be a safety issue, even if your home is not accessible by other users.

GitHub easily allow to use an ssh-key.

First you can generate your keyfile by:

ssh-keygen -t rsa -b 4096 -C ""...follow instruction and note where your file is saved for the following commmand...ssh-add ~/.ssh/id_rsa


The raw output of NGS experiments should be backed, and format-specific compressor like “dsrc2” comes handy for this

Let’s skip the introductory comments about the frustration of seeing our disk quota almost full in any cluster we use for NGS data: it’s a well-known story, and as soon as we keep out of sight any unnecessary large file (.sam files anyone?) we probably need to gain some extra space.

FASTQ files must be backed up

The definitive long-term storage for FASTQ file is a public repository. Both NCBI and EBI allow for easy download of raw reads using command line tools, so there are no excuses not to upload our data as soon as possible.

At the same time, we often need a local…

Two popular formats to store structured data, that are also commonly used in bioinformatics analyses.

Bioinformatics and text files

Most bioinformatics file formats are simple text files, a famous example being the FASTA format to store sequences. Historically, most file formats were proposed to ad hoc address a specific need, resulting in a fragmented universe of formats.

Examples of famous bioinformatics formats are the FASTA and FASTQ for sequences, the SAM format to store details of sequence mappings, the VCF format to describe the variants of an individual compared to a reference genome, the GFF and BED formats to describe features in a genome (e. g. genes, enhancers, binding sites…).


Among the “general purpose” formats commonly used in computer…

This article describes a Bash script to perform a simple task with some controls and best practices.

The problem

Illumina NextSeq 500 will produce four separate files for each sample and we want to simply merge them in a single one (this is also true for HiSeq systems, but in that case the lanes are physical and it can be good to keep the file separate).

The solution

Illumina sequences are stored in FASTQ files compressed with Gzip. A typical filename is something like Sample1_S1_L001_R1_001.fastq.gz, the bold in L001 is exactly indicating that this file is the first of a set of four (L001 to L004).

To merge text files we can concatenate them with cat, and the neat thing about…

Using multiple threads to compress files but maintaining compatibility with gunzip

The gzip program is incredibly popular among Linux applications, in spite of some limitations. I think it fits incredibly well the Unix Philosophy of a single program, doing a single thing and doing it well.

For example, we cannot compress a directory tree with gzip, but only single files. This makes gzip ideal to compress files that we want to use. On the other hand, compressing a whole directory is generally used for archiving reasons where accessing a single file is not the most common scenario.

Gzip libraries are commonly used to allow program directly reading compressed files, that why…

Preventing unbound variables is good, but we need a way to tell if a variable has no content or has never been initialized

A variable is often referred to as a box with a name and a content. A command like the following:

echo Hello $name

Will print Hello, and then… the content of the box named ‘name’. If the box is empty Bash will print nothing, as expected. Bash will print nothing also if the box was never created!

As beginners, we just care about variable containing something, but soon or later we will need to distinguish among:

  • Undefined variables (variable never created)

The minimum workflow for keeping file revisions under control

When working on a script (or manuscript), version control enables the possibility to restore the last working version. Software development, where multiple collaborators are editing different pieces of the same package, pushed the needs for version control systems to the highest standards (the most common example is the Linux kernel, an impressive open source collaborative effort that… lead to the development of Git itself).

If you use Google Documents, you are already familiar with a user friendly implementation of version control, as you can revert a set of no longer needed changes.

Git is a powerful (that means also: complex…

Sequencing data archives

When you publish a paper based on nucleic acids sequencing it is mandatory to submit the “raw” data, i. e. all the reads produced without any filtering process, to a public repository. The two major repositories for NGS reads are the SRA (Short Reads Archive, hosted in the USA by the NCBI), and the ENA (European Nucleotide Archive, hosted in the EU at the EBI). They are both very good, yet for this short note we’ll address the former.

They both have an interesting hierarchical data description, so that each “sequencing run” is linked to an experiment (or project) and…

Concepts and simple tools to evaluate the sequence coverage of a short reads dataset aligned against a reference genome.

Sequence coverage (or sequencing depth) refers to the number of reads that include a specific nucleotide of a reference genome. In the screenshot below a small region of the Human genome is shown in a genome browser, where the alignment of a re-sequencing experiment was loaded. Each gray arrow in the main area of the window represents a single read, while the highlighted area is a coverage plot, summarizing for each nucleotide how many times was sequenced in this experiment.

A screenshot of IGV showing a 0.5 kbp genome region. A BAM file with reads aligned is loaded as a track, and IGV will automatically plot a coverage track on top of the alignments.

If the sequencing was performed using paired ends or mate pairs, it’s possible to physical coverage is the number of…

Andrea Telatin

Bioinformatics and Genomics, Norwich UK

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store