A gentle introduction to bíogo: part I

As a bioinformatician I mostly use Python to solve day-to-day data analysis problems. However, Go (also known as Golang) is pretty much my favourite programming language these days. I have used it for one of my previous projects (rlsim, an RNA-seq library simulator) and still occasionally use it for more heavyweight projects.

There is an insightful post on Hacker Noon on the beauty and advantages of Go, so I will not spend much time on advocacy. What appeals the most to me in Go is type safety and its simplicity. Go has everything you missed from C (such as maps, methods, and arguably garbage collection), without the features which make programming in C dangerous. This is no mistake, as Go was conceived as successor to C by Robert Griesemer, Rob Pike, and Ken Thompson (inventor of B, which is the predecessor of C).

Go is an emerging language, so its library ecosystem is not as diverse as in the case of Python. Still, it already has a fairly mature bioinformatics library — the bíogo project, lead by Dan Kortschak. Back when I was working on rlsim the library was less mature and I did not need most of its features, so I decided not to depend on it. Now that it came of age I reconsidered using the library and I thought I would share the process of exploring its basic functionality in a form of a tutorial. Maybe it will increase the appetite of newbie gophers and bioinformaticians for more advanced uses of bíogo. Even if you are just starting with bioinformatics it might be a good experience experimenting with with a compiled, statically typed language.

This tutorial assumes familiarity with the basics of Go, so I recommend having a go with A Tour of Go and maybe reading Effective Go. It also assumes basic knowledge of Unix shell and access to a Linux box.

Setting up the environment

One can install Go according to the official instructions, but we will do it through miniconda, as this will allow us to install other tools used during this tutorial. So the first step of the journey is to download and install miniconda (preferably with Python 3) according to these instructions. When done with that, we need to add some channels by issuing:

$ conda config --add channels conda-forge
$ conda config --add channels bioconda

Now we are ready to install Go and wub, a Python package which we use to generate data serving as input to our programs.

$ conda install go
$ pip install git+https://github.com/nanoporetech/wub.git

We are almost done setting up the environment, all we need is installing bíogo itself using the go get tool.

$ go get -v github.com/biogo/biogo/...

Now that we are done setting up the environment, we will simulate a small input “genome” and some reads in fastq format using tools from the wub package (they should now be in your path). Keep the genome fasta around, as we will make a use of it in part II.

$ simulate_sequences.py -n 1 -m 5000 genome.fas
$ simulate_sequencing_simple.py -n 1000 -m 200 genome.fas reads.fq

We end up with a fastq file (reads.fq) containing a thousand reads having a mean length of 200.

$ head -4 reads.fq 
@r0_seq_0_1495_1807_+/q9/s5/d31/i5
TCTTACTTACGCGTCATCCTCCCTCATTCCCAGACATACGTACTGGTAAGCCAGCATCTCACCCATGT
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
....

Reading a fastq file

So let us jump in and write a Go program to read the entries of the fastq file and print out the read identifiers. We will store our Go source files in a subdirectory which will end up being the name of the executable built by the go build tool.

$ mkdir fastq_read; cd fastq_read

Now use your favourite text editor and create a main.go file with the following contents:

Then build and run the executable by issuing:

$ go build
$ ./fastq_read ../reads.fq
r0_seq_0_1495_1807_+/q9/s5/d31/i5
r1_seq_0_1952_2583_-/q10/s8/d40/i13
r2_seq_0_2926_3143_+/q8/s8/d23/i4
r3_seq_0_4187_4324_+/q10/s2/d11/i2
r4_seq_0_1292_1412_+/q10/s3/d7/i1
r5_seq_0_2978_3566_+/q10/s10/d45/i7
...

The example is pretty simple, we performed the following steps:

  • Import the necessary standard and bíogo libraries.
  • Open the file specified as the first command line argument (os.Args[1]) — a real world application would use the flag package instead.
  • In the examples we do not ignore errors, but instead we are dealing with them by aborting right away using panic.
  • We needed to construct a template sequence object for the fastq reader, through a call to linear.NewQSeq(), creating an empty sequence object with DNA alphabet, Sanger quality encoding and an empty ID.
  • Understanding the concept of interfaces is key to getting into Go programming. Here the *os.File returned by os.Open() satisfies the io.Reader interface and the template sequence satisfies the seqio.SequenceAppender interface, so we can pass them to the fastq.NewReader function which will return a pointer to a new fastq reader.
  • In the loop we read the next record by calling reader.Read(), breaking it when we get an end-of-file error.
  • The call to reader.Read() returns an object satisfying the seq.Sequence interface having the following methods:
type Sequence interface {
Feature
At(int) alphabet.QLetter // Return the letter at a specific position.
Set(int, alphabet.QLetter) error // Set the letter at a specific position.
Alphabet() alphabet.Alphabet // Return the Alphabet being used.
RevComp() // Reverse complement the sequence.
Reverse() // Reverse the order of elements in the sequence.
New() Sequence // Return a zero value of the sequence type, with the same alphabet.
Clone() Sequence // Return a copy of the Sequence.
CloneAnnotation() *Annotation // Return a copy of the sequence's annotation.
Slicer
Conformationer
ConformationSetter
}
  • For each sequence object, we call the CloneAnnotation() method to get an annotation structure containing the sequence ID, which we print out to the standard output.

Now what if we want to look at the first 10 bases of each sequence? Let us first copy the source and then modify it to print them out.

$ cd ..
$ cp -r read_fastq/ fastq_read_seq
$ cd fastq_read_seq

Edit main.go so it looks like the following code:

Internally, the linear.QSeq objects (which we gave as template to the fastq reader) store the sequence record as a slice of alphabet.QLetter structs. From the interface definition of seq.Sequence above we can figure out that the At() method can be used to access the QLetter structure corresponding to a position. After doing that all its left is to access the base at ql.L and convert it to a string before printing it out.

Now if we build and run the new program we get the first 10 bases in the output:

$ go build
$ ./fastq_read_seq ../reads.fq
r0_seq_0_1495_1807_+/q9/s5/d31/i5 TCTTACTTAC
r1_seq_0_1952_2583_-/q10/s8/d40/i13 ATTAAACACC
r2_seq_0_2926_3143_+/q8/s8/d23/i4 GAATTACGGT
r3_seq_0_4187_4324_+/q10/s2/d11/i2 CGCGGTTACC
r4_seq_0_1292_1412_+/q10/s3/d7/i1 TCAGCGCTGC
r5_seq_0_2978_3566_+/q10/s10/d45/i7 GTACGGTGTT
r6_seq_0_1878_2426_+/q10/s10/d33/i9 GCAACAGTTC
...

Format conversion

There’s a joke that says that 80% of bioinformatics is format conversions. We have managed to read a fastq file using bíogo, so we are halfway there! Let’s strip our sequences of the quality values by converting them to fasta format.

We will modify the first example, so first we copy it:

$ cd ..
$ cp -r fastq_read fastq_to_fasta
$ cd fastq_to_fasta

The the code for converting fastq to fasta will look like this:

Pretty similar to the first example, but we can notice some changes:

  • Now we import the library for handling input and output in fasta format. We no longer import fmt, as it is not needed. Do not forget to remove it or your program will not build.
  • We open a second file for writing and create a fasta writer object.
  • Then in the loop we write out the record instead of printing out the sequence ID. Simple enough!

Update main.go with the new code, and build the new binary. After building and running the converter we can check whether the output is valid:

$ go build
$ ./fastq_to_fasta ../reads.fq reads.fas
$ head -8 reads.fas
>r0_seq_0_1495_1807_+/q9/s5/d31/i5
TCTTACTTACGCGTCATCCTCCCTCATTCCCAGACATACGTACTGGTAAGCCAGCATCTCACCCATGTTTACTGGATTGC
AGGTATGCTCGTCTTCCCACCAAATAATCATGAGTACTGCTTATGAGGTTAATATGCATGAATATCGATATGACTAATCA
GCTCCCACGACTTAGGCTAGTACTGCCCGCAAAGATTCGGATACAATCTTGATCGGGGTCTAGCAAACAGCGGTTCCCCG
TGAGGTTTGCGATAATAGCAACTATTTAACTCTGACAATGTGGGTG
>r1_seq_0_1952_2583_-/q10/s8/d40/i13
ATTAAACACCAGTCGTTTCGAACTGATCTAATTATGTATGCGTGGTATGCCCGTGGCGAAAGCCTACTCCTCTGACCGCT
TCAACCATTGCATACATCACGCGTTAGCTCTATGGACGGTCAAAATTATCCAACTGGCACTGTCGGGTTCTTCTCATGCA

Calculating the mean quality score

We had some fun with IDs and sequences so far, but what about the quality values? During analysing sequencing data we are often interested in the mean base quality of a read which is handy for filtering. So let us calculate the mean base quality value of the records in our fastq input.

Remember that the records are stored as a slice of QLetter structs, which we can access via the At() method of the seq.Sequence objects — so far so good. One would be tempted then to get the quality values and average them, but that would be the wrong way. So what we will do is convert the quality scores into error probabilities, average those and convert back into quality scores.

$ cd ..
$ cp -r fastq_read mean_qual
$ cd mean_qual

Update main.go with the code below:

$ go build
$ ./mean_qual ../reads.fq
r0_seq_0_1495_1807_+/q9/s5/d31/i5 40
r1_seq_0_1952_2583_-/q10/s8/d40/i13 40
r2_seq_0_2926_3143_+/q8/s8/d23/i4 40
r3_seq_0_4187_4324_+/q10/s2/d11/i2 40
r4_seq_0_1292_1412_+/q10/s3/d7/i1 40
...

After building and running the program we can see that all mean quality values are equal to 40, which is not surprising as we used simulated data which had no other quality scores. Running it on real data would give more diverse output.

The first part of the tutorial ends here. I hope it was fun to experiment with the basics of bíogo.

Click here for part II.