A gentle introduction to bíogo: part II

In part I of this gentle introduction to bíogo we looked at some basic functionality provided by the library, such as sequence input and output and the manipulation of the objects to access sequence and quality data. However today’s bioinformatics is all about analysis of high throughput sequencing datasets in the context of existing information (such as annotations). Hence, in part II we will explore the biogo/hts library dealing with the SAM/BAM format and also the handling of features in core bíogo. At the end of the tutorial we will end up with a small tool which could be potentially useful in practice.

Setting up the environment

Before heading into the actual Go action we need to do some more setup of the environment and simulation of input data. First we need to install some tools via miniconda, which we already installed in part I. Also we need to install biogo/hts itself using the go get tool:

$ conda install samtools bedtools
$ go get -v github.com/biogo/hts/...

Next, we need to simulate some alignments from our genome fasta created previously using the already installed wub tools and convert them to BAM format:

$ simulate_sequencing_simple.py -n 5000 -m 200 -s alignments.sam genome.fas > /dev/null
$ samtools view -Sb alignments.sam > alignments.bam
$ samtools flagstat alignments.bam
5000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
5000 + 0 mapped (100.00% : N/A)
...

Finally, we will create a three-column BED file using bedtools:

$ echo -e "seq_0\t5000\n" > genome.g
$ bedtools makewindows -w 200 -s 500 -g genome.g > test.bed
$ head -4 test.bed
seq_0 0 200
seq_0 500 700
seq_0 1000 1200
seq_0 1500 1700
...

Above we have created a BED file with non-overlapping features of size 200, repeating at intervals of 500. We have the input files, so get ready to write some Go.

Reading records from a BAM file

There is already a nice example at the GitHub page of the biogo/hts library of how to read BAM files and summarising the records, but we will start with a simpler example which just reads the records from a BAM and writes out the names of aligned reads.

$ mkdir read_bam
$ cd read_bam

It is fairly simple to understand what is going on: we opened the input file and used it to create a new BAM reader, which we used in the loop to print out the names of aligned reads. The bamReader.Read() method returns a pointer to a sam.Record struct which looks like this:

type Record struct {
Name string
Ref *Reference
Pos int
MapQ byte
Cigar Cigar
Flags Flags
MateRef *Reference
MatePos int
TempLen int
Seq Seq
Qual []byte
AuxFields AuxFields
}

So information like the name of the read or the position is readily accessible.

Reading features

Now let us switch back to the core bíogo library and see how can we parse the BED file which we created above and print out the properties of the resulting features.

$ cd ..
$ mkdir read_bed
$ cd read_bed

Again, it is not difficult to figure out what the code does. The bedReader.Read() method returns an object satisfying the feat.Feature interface, which has a Location() method giving us the chromosome of the feature. Also, feat.Feature has a feat.Range interface embedded in it, which gives us the methods for getting the start and end of the feature.

Counting feature overlaps: fragtool

Now it is time to put the two above examples together and write a small tool which calculates the “fragment coverage” of the features defined in a BED file by alignments stored in a BAM file. To achieve this we will store the features from the BED file in a map data structure mapping them to integers. Then we will parse the BAM file and for each read we iterate over all features and increment the count if the position of the read overlaps with the current feature.

$ cd ..
$ cp -r read_bam fragtool
$ cd fragtool

We build the tool and try it out on our test data:

$ go build
$ ./fragtool ../alignments.bam ../test.bed
seq_0 1000 1200 219
seq_0 2000 2200 211
seq_0 4500 4700 159
seq_0 0 200 211
seq_0 500 700 204
seq_0 1500 1700 214
seq_0 2500 2700 233
seq_0 3000 3200 190
seq_0 3500 3700 219
seq_0 4000 4200 212

This small tools is functional, however it does something inefficient: for each read it will iterate over all features when checking for overlap. It would be more ideal if we had a data structure similar to the GenomicArrayOfSets implemented in the HTSeq Python library which would allow us to ask for the set of features overlapping with a certain position.

Where to Go next

This is the end of our short exploration of bíogo. For the reader interested in more I would recommend reading the official documentation starting with the following sections: alphabet, feat, seq, seq/linear. There are also a lots of example applications of varying complexity making use of bíogo at biogo/examples.

The bíogo library is already quite feature-rich (many thanks to the developers for that) and Go is getting traction. Hopefully this will lead to its wider adoption and many more tutorials like this.