A Julia script to concatenate (FASTA format) multiple sequence alignments

August Guang
Computational Biology Core
3 min readAug 23, 2019

After going to JuliaCon2019 I’ve been inspired to use Julia more in my projects. I had previously tried it out before, but it was right after the transition from 0.6 to 1.0, so I kept running into problems from code and package documentation that was now outdated. I got frustrated and dropped it.

Now a lot of the documentation has caught up, and the core packages in BioJulia have been updated as well, so I’ve picked it back up again.

MSA image taken from Wikipedia; generated with ClustalX

The first task I decided to use Julia for was a script to concatenate multiple sequence alignments that were in FASTA format, such as those outputted by mafft. This is to then run a phylogeny inference program on them, such as RAxML. My coworker Mary McGrath helped tremendously with optimizing it by taking advantage of accessing the data and arrays in FASTA records which is of type UInt8 directly rather than working with their string representations. The code is shown below and can also be found in our BioCookbook repo, which will host Julia scripts we use.

In [1]:

using FASTX
using BioSequences
function addRecords!(dest::FASTX.FASTA.Record, src::FASTX.FASTA.Record)
append!(dest.data, src.data[src.sequence])
dest.filled = UnitRange{Int}(dest.filled.start, dest.filled.stop + src.filled.stop - src.sequence.start + 1)
dest.sequence = UnitRange{Int}(dest.sequence.start, dest.sequence.stop + src.sequence.stop - src.sequence.start + 1)
end
function concatenate(infiles::Vector{String}, outfile::String)
concat = Dict{String, FASTX.FASTA.Record}()
allLabels = Set(String[])
alignLen = 0
alphabet=""
firstrecord=true
for (i, file) in enumerate(infiles)
theseLabels = String[]
if i == 1 # first file will have only new ids
reader = open(FASTX.FASTA.Reader, file)
for record in reader
if firstrecord==true
alignLen += length(record.sequence)
firstrecord=false
alphabet = typeof(FASTX.FASTA.sequence(record))
end
id = FASTX.FASTA.identifier(record)
push!(allLabels, id)
concat[id] = record
end
close(reader)
else # later records need to check for id existance
record = FASTX.FASTA.Record()
open(FASTX.FASTA.Reader, file) do reader
while !eof(reader)
read!(reader, record)
id = FASTX.FASTA.identifier(record)
push!(theseLabels, id)
if id in allLabels
addRecords!(concat[id], record)
else # need to create and back-propogate gaps
push!(allLabels, id)
newRecord = FASTX.FASTA.Record(id,repeat("-",alignLen))
pop!(newRecord.data)
concat[id] = newRecord
addRecords!(concat[id], record)
end
end
alignLen += length(record.sequence)
# assume last sequence will be representative of alphabet
@assert alphabet == typeof(FASTX.FASTA.sequence(record))
end
missingLabels = setdiff(allLabels, theseLabels)
for label in missingLabels
gapRecord = copy(record)
gapRecord.data[gapRecord.sequence] = repeat(UInt8[0x2d], length(gapRecord.sequence))
addRecords!(concat[label], gapRecord)
end
end
end
open(FASTX.FASTA.Writer, outfile) do w
for (key, value) in concat
write(w, value)
end
end
return concat
end

Out[1]:

concatenate (generic function with 1 method)

For me, the main appeal of Julia has been the idea that it can deal with the two-language problem of prototyping and visualizing in one easy-to-use dynamic language like Python or R and rewriting for speed in a lower-level compiled language like C++. This means I’m interested in benchmarking my code against those in other languages.

Benchmarking in Julia is very easy with BenchmarkTools.jl. I took a Python concatenation implementation using BioPython from here and modified it very slightly to benchmark against with timeit. For both I wrote a quick alignment simulation function to generate numFiles alignments and numSeqssequences in each alignment. The actual code for those I won't show, but they can be found in BioCookbook. The results are below.

Julia benchmark

Simulation code is in BioCookbook/src/simulation.jl.

In [2]:

#Pkg.add("http://github.com/compbiocore/BioCookbook.jl")
import BioCookbook
using BenchmarkTools
tmpdir = tempdir()
infiles = BioCookbook.simAlignments(30,100,tmpdir)
tmpfile = mktemp(tmpdir)[1]
@benchmark concatenate($infiles,$tmpfile)

Out[2]:

BenchmarkTools.Trial: 
memory estimate: 8.44 MiB
allocs estimate: 96269
--------------
minimum time: 7.985 ms (0.00% GC)
median time: 8.564 ms (0.00% GC)
mean time: 8.967 ms (5.83% GC)
maximum time: 49.570 ms (83.00% GC)
--------------
samples: 557
evals/sample: 1

In [3]:

# doing some cleanup
for f in infiles
rm(f)
end
rm(tmpfile)

Python benchmark

Concatenation and simulation code is in ‘BioCookbook/python/concatenate.py’.

In [4]:

;python -m timeit -s 'import tempfile; tmpdir=tempfile.TemporaryDirectory(); from python.concatenate import simAlignments; infiles=simAlignments(30,100,tmpdir.name); outf=tempfile.NamedTemporaryFile().name;' 'from python.concatenate import concatenate; concatenate(infiles,outf)'10 loops, best of 3: 47.3 msec per loop

The Julia implementation is about 5x faster which is pretty neat; motivation to continue using Julia in the future. The benchmarks can be reproduced in benchmark.ipynb in BioCookbook, although I didn't set a seed so your results may slightly vary.

NOTE: this post is also cross-posted at our Brown-CCV blog since it is relevant there as well.

--

--