A Julia script to concatenate (FASTA format) multiple sequence alignments

Published in

Computational Biology Core

3 min readAug 23, 2019

After going to JuliaCon2019 I’ve been inspired to use Julia more in my projects. I had previously tried it out before, but it was right after the transition from 0.6 to 1.0, so I kept running into problems from code and package documentation that was now outdated. I got frustrated and dropped it.

Now a lot of the documentation has caught up, and the core packages in BioJulia have been updated as well, so I’ve picked it back up again.

MSA image taken from Wikipedia; generated with ClustalX

The first task I decided to use Julia for was a script to concatenate multiple sequence alignments that were in FASTA format, such as those outputted by mafft. This is to then run a phylogeny inference program on them, such as RAxML. My coworker Mary McGrath helped tremendously with optimizing it by taking advantage of accessing the data and arrays in FASTA records which is of type UInt8 directly rather than working with their string representations. The code is shown below and can also be found in our BioCookbook repo, which will host Julia scripts we use.

In [1]:

using FASTX
using BioSequencesfunction addRecords!(dest::FASTX.FASTA.Record, src::FASTX.FASTA.Record)
  append!(dest.data, src.data[src.sequence])
  dest.filled = UnitRange{Int}(dest.filled.start, dest.filled.stop + src.filled.stop - src.sequence.start + 1)
  dest.sequence = UnitRange{Int}(dest.sequence.start, dest.sequence.stop + src.sequence.stop - src.sequence.start + 1)
endfunction concatenate(infiles::Vector{String}, outfile::String)
  concat = Dict{String, FASTX.FASTA.Record}()
  allLabels = Set(String[])
  alignLen = 0
  alphabet=""
  firstrecord=true
  for (i, file) in enumerate(infiles)
    theseLabels = String[]
    if i == 1 # first file will have only new ids
      reader = open(FASTX.FASTA.Reader, file)
      for record in reader
        if firstrecord==true
          alignLen += length(record.sequence)
          firstrecord=false
          alphabet = typeof(FASTX.FASTA.sequence(record))
        end
        id = FASTX.FASTA.identifier(record)
        push!(allLabels, id)
        concat[id] = record
      end
      close(reader)
    else # later records need to check for id existance
      record = FASTX.FASTA.Record()
      open(FASTX.FASTA.Reader, file) do reader
        while !eof(reader)
          read!(reader, record)
          id = FASTX.FASTA.identifier(record)
          push!(theseLabels, id)
          if id in allLabels
            addRecords!(concat[id], record)
          else # need to create and back-propogate gaps
            push!(allLabels, id)
            newRecord = FASTX.FASTA.Record(id,repeat("-",alignLen))
            pop!(newRecord.data)
            concat[id] = newRecord
            addRecords!(concat[id], record)
          end
        end
        alignLen += length(record.sequence)
        # assume last sequence will be representative of alphabet
        @assert alphabet == typeof(FASTX.FASTA.sequence(record))
      end
      missingLabels = setdiff(allLabels, theseLabels)
      for label in missingLabels
        gapRecord = copy(record)
        gapRecord.data[gapRecord.sequence] = repeat(UInt8[0x2d], length(gapRecord.sequence))
        addRecords!(concat[label], gapRecord)
      end
    end
  end
  open(FASTX.FASTA.Writer, outfile) do w
    for (key, value) in concat
      write(w, value)
    end
  end
  return concat
end

Out[1]:

concatenate (generic function with 1 method)

For me, the main appeal of Julia has been the idea that it can deal with the two-language problem of prototyping and visualizing in one easy-to-use dynamic language like Python or R and rewriting for speed in a lower-level compiled language like C++. This means I’m interested in benchmarking my code against those in other languages.

Benchmarking in Julia is very easy with BenchmarkTools.jl. I took a Python concatenation implementation using BioPython from here and modified it very slightly to benchmark against with timeit. For both I wrote a quick alignment simulation function to generate numFiles alignments and numSeqssequences in each alignment. The actual code for those I won't show, but they can be found in BioCookbook. The results are below.

Julia benchmark

Simulation code is in BioCookbook/src/simulation.jl.

In [2]:

#Pkg.add("http://github.com/compbiocore/BioCookbook.jl")
import BioCookbook
using BenchmarkTools
tmpdir = tempdir()
infiles = BioCookbook.simAlignments(30,100,tmpdir)
tmpfile = mktemp(tmpdir)[1]
@benchmark concatenate($infiles,$tmpfile)

Out[2]:

BenchmarkTools.Trial: 
  memory estimate:  8.44 MiB
  allocs estimate:  96269
  --------------
  minimum time:     7.985 ms (0.00% GC)
  median time:      8.564 ms (0.00% GC)
  mean time:        8.967 ms (5.83% GC)
  maximum time:     49.570 ms (83.00% GC)
  --------------
  samples:          557
  evals/sample:     1

In [3]:

# doing some cleanup
for f in infiles
    rm(f)
end
rm(tmpfile)

Python benchmark

Concatenation and simulation code is in ‘BioCookbook/python/concatenate.py’.

In [4]:

;python -m timeit -s 'import tempfile; tmpdir=tempfile.TemporaryDirectory(); from python.concatenate import simAlignments; infiles=simAlignments(30,100,tmpdir.name); outf=tempfile.NamedTemporaryFile().name;' 'from python.concatenate import concatenate; concatenate(infiles,outf)'10 loops, best of 3: 47.3 msec per loop

The Julia implementation is about 5x faster which is pretty neat; motivation to continue using Julia in the future. The benchmarks can be reproduced in benchmark.ipynb in BioCookbook, although I didn't set a seed so your results may slightly vary.

NOTE: this post is also cross-posted at our Brown-CCV blog since it is relevant there as well.

A Julia script to concatenate (FASTA format) multiple sequence alignments

Julia benchmark

Python benchmark

Written by August Guang