Turn BLAST results into a presence/absence matrix

(Also an introduction to working with tabular data using pandas)

5 min readOct 31, 2017

If you’re interested in the evolution of a gene or particular set of genes, you may want to try out some tools that predict gain/loss events, perhaps to answer the question:

When (or where on a phylogenetic tree) was this character acquired?

You may be aware that you can search for homologous sequences among a group of organisms using BLAST, which you can run on NCBI’s servers or locally on your own.

So let’s say you download the genomes you’re interested in and BLAST your genes of interest against them, returning a table of homologous sequences. How do you go from those results to something you can feed to a gain/loss prediction tool?

Make friends with pandas. Load your results.

While there are many ways you could go about approaching this task, I’ve found it easiest to work with tabular output from BLAST using the pandas library. Assuming you’re working with BLAST output like me, your command to run the search looked something like:

blastn -query genes_of_interest.fa -db rel_genome_DB -outfmt "6 qseqid sseqid stitle pident qcovs length mismatch gapopen qstart qend sstart send qframe sframe frames evalue bitscore qseq sseq" -evalue 0.00001 -task blastn -num_threads 4 -word_size 7 -max_target_seqs 100000 -perc_identity 30 -out ./rel_BLAST_results.txt

I mention this because BLAST tabular output has no column headers, so you’re going to want to grab the list of columns you asked for so that later on, you know what you’re looking at.

Let’s import pandas and load our BLAST results:

import pandas as pdwith open('rel_BLAST_results.txt', 'r') as infile:# Here's where the BLAST command comes in handy
    col_names = ['qseqid', 'sseqid', 'stitle', 'pident', 'qcovs', \ 
        'length', 'mismatch', 'gapopen', 'qstart', 'qend', \
        'sstart, 'ssend', 'qframe', 'sframe', 'frames', 'evalue', \ 
        'bitscore', 'qseq', 'sseq']
    df = pd.read_csv(infile, sep='\t', header=None, names=col_names)

Subject sequence IDs (sseqid) by default looks like ref|Accession# and I’m not a big fan of that, so let’s change it so it just shows the accession number on its own. I’ll first define a function that I can apply across the whole dataframe:

def get_acc_num(str_in):
    acc_num = str_in.split('|')[1]
    return acc_num

Then apply it:

df['sseqid'] = df['sseqid'].apply(get_acc_num)

I’ll also create an empty dataframe to hold my presence/absence matrix.

pa_matrix = pd.DataFrame()

Filter as desired. Load organism metadata.

How good a given BLAST hit has to be in order for you to count that sequence as “present” in that organism is up to you. In my case, you may have noticed that I specified an e-value of 0.00001 or less in my BLAST results in the first place. Immediately after running my BLAST, I also pared my results down to include only the top hit per query-subject pair.

Let’s suppose you hadn’t applied an e-value threshold already. You could easily add that here, either overwriting your original dataframe or creating a new one with your filter(s) applied.

df_fil =  df.drop(df[df.evalue > 0.00001].index)

In addition to filtering your data, you may also need to provide some additional metadata if your BLAST subject IDs don’t correspond directly to your desired “leaves” on your phylogenetic tree. For example, in my case, I want to know if a given gene is present in a given bacterium, but to do that I’m searching both the organisms chromosome and any plasmids it might have. If there’s a good hit to either the chromosome OR the plasmid (or both), I want to designate the gene *present* for the entire organism. You’d also run into this scenario in an organism with multiple chromosomes.

Once again, there are many possible ways to address this issue, but in my case I simply made a dictionary where all the keys are the sseqids that appear in my BLAST results, and all the values are the IDs or names of organisms that represent the leaves on my tree. My dictionary looks something like this:

{'NC_008783.1': 'GCF_000007765.2', 'NC_002971.4': 'GCF_000007765.2', 'NC_004704.2': 'GCF_000007765.2', 'NC_009727.1': 'GCF_000017105.1', 'NC_009726.1': 'GCF_000017105.1', ... }

And I’ll load it using pickle:

import pickleasmbly_dict = pickle.load(open('asmbly_dict.p', 'rb'))

Group results by character and designate presence/absence.

Pandas lets me loop through my BLAST results by groupings of my choosing. In this case, I want to consider my characters (in this case, genes) individually: just because one of the genes is present in a given organism doesn’t mean it’s present in all.

For each gene, I create a dictionary with all my organisms as keys and True/False (present/absent) as my values, then add these to the empty dataframe I created earlier. There may be a more efficient way of doing this, but it works for me! I’m going to go back to just calling my dataframe df, but if you have a separate dataframe with filtered results, don’t forget to use that instead.

# Group results by query
grp = df.groupby(df['qseqid'])

When you loop through a pandas GroupBy object, you can access both the name of your grouping and the slice of your dataframe that falls in that group. So in this case, name is the name of the gene and data is a dataframe of results for just that gene.

for name, data in grp:
    print(name)    pa_dict = {}# Get a list of all the accession numbers that have \
    results for that gene
    present = data['sseqid'].tolist()# Loop through organisms of interest. Mark as 'present' in \
    organism based on sseqid
    for key, value in asmbly_dict.items():
        if key in present:
            pa_dict[value] = True# If it's already been marked present in that organism, it \
    should stay that way, because there might be multiple \
    sseqids per organism. So we need to check to see if it's \
    already there before marking anything 'False.'
        else:
            if value in pa_dict:
                continue
            else:
                pa_dict[value] = False# Add that dictionary to the master dataframe
    pa_matrix[name] = pd.Series(pa_dict)

Write results to desired format (including MSA)

Next we just convert our trues and falses to ones and zeros, then write to file:

pa_matrix = pa_matrix.astype(int)with open('matrix.csv', 'w') as csv:
    pa_matrix.to_csv(csv)

Some programs want your presence/absence matrix as a multiple-sequence alignment (MSA), in fasta format. This looks just like a fasta file that has DNA or RNA in it, but instead of nucleotides it’s all 1s and 0s. Each position represents a different character (e.g. gene).

>A
11101010111000101011111011111
>B1
11111011000100110110000011111
>B2
10110001111001110101010111111
>C
11011000000101111000101000001

In order to get our results in this format, we need to take a couple of extra steps. I’m going to use BioPython because it will write fasta format for us.

from Bio import SeqIO
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord

Next, we join what used to be the contents of separate cells into one long string:

pa_msa = pa_matrix.apply(lambda row: ''.join(map(str, row)), axis=1)

I looped through my dataframe because that was the easiest for me to understand at the time… there’s likely a more efficient method.

with open('msa.fasta', 'w') as msa_out:
    msa_records = []
    for i, val in pa_concat.iteritems():
        seq = Seq(str(val))
        record = SeqRecord(seq, id=i)
        msa_records.append(record)
    SeqIO.write(msa_records, msa_out, 'fasta')

All set! Now go forth and make some trees.