Scripts for Bioinformaticians

A collection of re-usable bash/C++ scripts for bioinformatics processing

In this article, we will present you with a set of handy scripts you can use to pre-process your sequencing data.

Image by James Osborne from Pixabay

Data Conversion

A major use case in preprocessing is converting data between FASTA and FASTQ. FASTA is usually more portable in terms of file size (Nearly half of the size of FASTA).

# single line fasta
awk '{if(NR%4==1){print ">" substr($1, 2)}else if(NR%4==2){print}}' FILE.fastq > FILE.fasta
# folded fasta format
awk '{if(NR%4==1){print ">" substr($1, 2)}else if(NR%4==2){print}}' FILE.fastq | fold > FILE.fasta
# Convert FASTA to single line FASTA
awk '{if(NR==1 && $0 ~ /^>/){print}else if (NR!=1 && $0 ~ /^>/){print "\n" $0}else {printf $0}} END {printf "\n"}' FILE.fasta > NEW_FILE.fasta

Renaming FASTQ Reads

awk '/^@EXISTING_NAME_REGEX/{print "@TARGET_NAME_" ++i; next}{print}' FILE.fastq > NEW_FILE.fastq

Reading GZ Files

Some experiments only need a few set of reads. In such a case, it makes no sense for one to extract the whole file and head it. But we can do something like this.

zcat FILE.tar.gz | head -nNO_READS > FILE.fastq

Obtaining Read Lengths from FASTQ

In order to plot histograms of reads lengths and other experiments such as filtering short reads, we can do the following. Note that we need these kinds of pre-processing with long-read applications.

awk "{{ if(NR%4==2) print length($1) }}" READS.fastq> LENGTHS.txt

Separating Paired-end Reads into two files

# Separate forward readsawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=0 a=3 s="/1" READS.fastq > R1.fastq# Separate reverse readsawk 'c-->0;$0~s{if(b)for(c=b+1;c>1;c--)print r[(NR-c+1)%b];print;c=a}b{r[NR%b]=$0}' b=0 a=3 s="/2" READS.fastq > R2.fastq

Reverse Complement of K-mer

In this computation, we encode k-mers in the form of binary. A-00, C-01, T-10 G-11 using the following method.

CHAR >> 1 & 3;

We use the above method in functions as below. Here we pass a string reference and the interval to compute k-mer of the desired length. However, since we use u_int64_t we are limited to 32-mer. For anything beyond this, try to use bitsets or boolean vectors.

Then we use the following set of bit operations to obtain the reverse complement.

Computing K-mer Histograms

We can extend the above functions to get the k-mer histograms for a sequence, which is very frequently used in Bioinformatics analyses.

In the first function, we obtain all the possible k-mers for a given k value. This is to make the computation much faster by having to update only a lookup table.

In the second function, we iterate the entire string and record the lowest complement in the copy of a lookup table. We do the copying so that we can re-use the same original lookup table in a multi-threaded scenario.

Conclusion

In this article, we discuss several tweaks that we use in our algorithms and research work. Feel free to use them in your work to make efficient computations.

Thanks for reading. Cheers!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store