Miniconda became the most popular package manager among bioinformaticians, with the dedicaded “BioConda” channel serving 8.000 packages. Its popularity relies on the ease of installation of the manager itself, that does not require administration privileges.
Conda, however, is essentialy a per user solution, and it’s not easy to make shared conda installations (and sometimes it’s not even convenient to try to do so).
In simple words, it’s a path that you add to your own
$PATH. But it’s easy to add and remove that path and it will have a nice nickname.
The commands to use modules are quite straightforward:
The unsafe way of caching git credentials is:
git config --global user.name "your username"
git config --global user.password "your password"
git config --global credential.helper store
This is a general procedure that can be used for any Git server, but considering it’s storing your password in plain text this could be a safety issue, even if your home is not accessible by other users.
GitHub easily allow to use an ssh-key.
First you can generate your keyfile by:
ssh-keygen -t rsa -b 4096 -C "email@example.com"...follow instruction and note where your file is saved for the following commmand...ssh-add ~/.ssh/id_rsa
Let’s skip the introductory comments about the frustration of seeing our disk quota almost full in any cluster we use for NGS data: it’s a well-known story, and as soon as we keep out of sight any unnecessary large file (.sam files anyone?) we probably need to gain some extra space.
The definitive long-term storage for FASTQ file is a public repository. Both NCBI and EBI allow for easy download of raw reads using command line tools, so there are no excuses not to upload our data as soon as possible.
At the same time, we often need a local…
Most bioinformatics file formats are simple text files, a famous example being the FASTA format to store sequences. Historically, most file formats were proposed to ad hoc address a specific need, resulting in a fragmented universe of formats.
Examples of famous bioinformatics formats are the FASTA and FASTQ for sequences, the SAM format to store details of sequence mappings, the VCF format to describe the variants of an individual compared to a reference genome, the GFF and BED formats to describe features in a genome (e. g. genes, enhancers, binding sites…).
Illumina NextSeq 500 will produce four separate files for each sample and we want to simply merge them in a single one (this is also true for HiSeq systems, but in that case the lanes are physical and it can be good to keep the file separate).
Illumina sequences are stored in FASTQ files compressed with Gzip. A typical filename is something like
Sample1_S1_L001_R1_001.fastq.gz, the bold in L001 is exactly indicating that this file is the first of a set of four (L001 to L004).
The gzip program is incredibly popular among Linux applications, in spite of some limitations. I think it fits incredibly well the Unix Philosophy of a single program, doing a single thing and doing it well.
For example, we cannot compress a directory tree with gzip, but only single files. This makes gzip ideal to compress files that we want to use. On the other hand, compressing a whole directory is generally used for archiving reasons where accessing a single file is not the most common scenario.
Gzip libraries are commonly used to allow program directly reading compressed files, that why…
A variable is often referred to as a box with a name and a content. A command like the following:
echo Hello $name
Will print Hello, and then… the content of the box named ‘name’. If the box is empty Bash will print nothing, as expected. Bash will print nothing also if the box was never created!
As beginners, we just care about variable containing something, but soon or later we will need to distinguish among:
When working on a script (or manuscript), version control enables the possibility to restore the last working version. Software development, where multiple collaborators are editing different pieces of the same package, pushed the needs for version control systems to the highest standards (the most common example is the Linux kernel, an impressive open source collaborative effort that… lead to the development of Git itself).
If you use Google Documents, you are already familiar with a user friendly implementation of version control, as you can revert a set of no longer needed changes.
When you publish a paper based on nucleic acids sequencing it is mandatory to submit the “raw” data, i. e. all the reads produced without any filtering process, to a public repository. The two major repositories for NGS reads are the SRA (Short Reads Archive, hosted in the USA by the NCBI), and the ENA (European Nucleotide Archive, hosted in the EU at the EBI). They are both very good, yet for this short note we’ll address the former.
They both have an interesting hierarchical data description, so that each “sequencing run” is linked to an experiment (or project) and…
Sequence coverage (or sequencing depth) refers to the number of reads that include a specific nucleotide of a reference genome. In the screenshot below a small region of the Human genome is shown in a genome browser, where the alignment of a re-sequencing experiment was loaded. Each gray arrow in the main area of the window represents a single read, while the highlighted area is a coverage plot, summarizing for each nucleotide how many times was sequenced in this experiment.
If the sequencing was performed using paired ends or mate pairs, it’s possible to physical coverage is the number of…
Bioinformatics and Genomics, Norwich UK