Batch download genomes from NCBI

Get ALL available complete genomes for a given bacterium, starting with just a list of species names.

Downloads from NCBI are pretty easy to get a handle on… if you know what you want.

In this case, I had the following requirements:

  • Find ALL the available genomes for a large list of species, starting only with the species name (as opposed to taxonomy ID or another identifier)
  • Download only those genomes that are complete, and for which the Genbank and RefSeq assemblies are identical (when they’re not, there’s often something fishy going on)
  • Save a separate file for each species
  • Save a record of all the information about the selected assemblies for later reference

Step 1) Get organized

At the end of the day, I often have to pass my data off to someone to look at in an excel workbook. So, might as well create that from the start rather than sifting through my output later. Pandas handles this for me nicely.

import pandas as pd
# Now I'm going to do everything from here forward in my workbook
with open('wkbkname.xlsx', 'wb') as outfile:
writer = pd.ExcelWriter(outfile)
# Let's make an empty dataframe where I'll keep my results
ga_df = pd.DataFrame()
# I also want to store a dict of organisms and associated ids
acc_dict = {}

Also going to import some modules I’ll need later:

import ftplib
from Bio import Entrez, SeqIO

Step 2) Identify the genomes of interest

Fortunately for us, NCBI keeps records of all the genome assemblies in their database on their FTP site. Specifically, there’s a file called an assembly summary for every species, which we can find because using the species name in the path.

First things first, let’s parse our list of species and gets those names in a format we can work with. In this case, my list of species was just a text file with one name on every line in the format Genus species (e.g. Escherichia coli), so I can read that into a list like so:

with open(sp_infile, 'r') as sp_in:
sp_list = list(line.rsplit('\n')[0] for line in sp_in)

We’re going to loop through this list, get the name in a format we want, then grab the file we want from NCBI’s servers.

for species in sp_list:
# Save both the full species name and shorter name     
fulln = species.split(' ')[0] \
+ '_' + species.split(' ')[1]
shortn = species.split(' ')[0][0] \
+ '_' + species.split(' ')[1]
# Create a name for the output file
fn = shortn + '_asummary.txt'
# Grab the assembly summary from NCBI and save a copy
with open(sum_fn, 'wb') as sumfile:
ftp = ftplib.FTP(host='ftp.ncbi.nih.gov', \
user='anonymous', \
passwd='youremail@domain.com')
dir = '/genomes/genbank/bacteria/' + fulln + '/'
ftp.cwd(dir)
ftp.retrbinary('RETR assembly_summary.txt', sumfile.write)
ftp.quit

Next we’ll read the assembly summary into a pandas data frame and figure out what we don’t want.

sum_fn = shortn + '_asummary.txt'
with open(sum_fn, 'rb') as sumfile:
asmbly_df = pd.read_csv(sumfile, \
sep='\t', skiprows=1, header=0)
# Drop if assembly level != Complete Genome
asmbly_df = asmbly_df.drop(asmbly_df[asmbly_df.assembly_level \
!= 'Complete Genome'].index)
# Drop if Genbank and RefSeq assemblies are not identical
asmbly_df = asmbly_df.drop(asmbly_df[asmbly_df.paired_asm_comp != 'identical'].index)

Step 3) Search NCBI for the necessary identifiers

Now we know which assemblies we’re interested in, but that’s actually not the identifier we need to download the genomes. Instead, we’re going to use the assembly id to query NCBI’s database using the ESearch utility, which we can use thanks to the Entrez module from Biopython.

So, let’s import Entrez and let NCBI know we’re coming:

Entrez.email = 'youremail@domain.com'

In this case, I wrote a function that takes an assembly id as input and returns the corresponding accession number. In this case, I’m only interested in the chromosomal genome of a bacterium, not any plasmids that it might have.

def get_accession(assembly_id): 

# First I define my search terms. No plasmids wanted here.
term = assembly_id + ' AND complete genome NOT plasmid[Title]'
# Then I'm going to search the nucleotide database
refseq_id = Entrez.read(Entrez.esearch(db='nucleotide', \
term=term))['IdList']
# Provided that there is a corresponding ID, let's fetch it
if len(refseq_id) == 0:
accession = None
else:
seq_record = Entrez.efetch(db='nucleotide', \
id=refseq_id, retmode='xml')
results = Entrez.read(seq_record)
accession = results[0]['GBSeq_accession-version']
return accession

Ok, since I now have a function that turns assembly IDs into accession numbers, I can apply this within my assembly data frame to keep everything all nice and organized:

# Apply my function across the dataframe
asmbly_df['paired_GB_accession'] = \
asmbly_df['gbrs_paired_asm'].apply(get_accession)
# For later use...
# Append retained records to assembly master dataframe
ga_df = ga_df.append(assembly_df)
# Add to my dictionary of orgs and ids
non_empty = asmbly_df[asmbly_df['paired_GB_accession'].notnull()]
acc_dict['%s' % fulln] = list(non_empty['paired_GB_accession'])

Step 4) Download the genomes!

Finally, I’ll download the genomes using EFetch and write them to file. In my case, I’m doing this species-by-species, in part because I want a separate fasta for every species (but not a separate fasta for every genome).

with open('%s_genomes.fa' % fulln, 'a') as outfile:
for acc in acc_dict['%s' % fulln]:
with Entrez.efetch(db='nucleotide', id=acc, \
rettype='fasta', retmode='text') as handle:
seq_record = SeqIO.read(handle, 'fasta')
SeqIO.write(seq_record, outfile, 'fasta')

When I’m all done (i.e. have looped through all my species of interest), I’ll write the full set of assembly records to file.

ga_df.to_excel(writer, sheet_name='selected_assemblies')
writer.save()

Congrats! You can download *ALL* available genomes for a given bacterial species, starting with only the species name.

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.