How to download All Bacterial Assemblies from NCBI

A short note on how one might download the complete bacterial assemblies from NCBI for research work

Published in

The Computational Biology Magazine

3 min readAug 9, 2020

One of the most important steps in genome analysis is gathering the data required for downstream research. This sometimes requires us to have the assembled reference genomes (mostly bacterial) so we can verify the classifiers trained or bins detected are correct and useful. This is often achieved using a BLAST search against the candidate reference genomes. However, it is very convenient to have our own BLAST database set up in advance if you are going to make a lot of search queries in future. Using the NCBI Web BLAST might not be a viable option if the project is long-running one with many experiments.

In this article, we will see how we can download the set of all the available bacterial references (or assemblies) from either GenBank or RefSeq databases. This wasn't quite straightforward, hence we present an article dedicated to this particular task.

Downloading Assembly Information

Assembly metadata is available at NCBI under the URL: ftp://ftp.ncbi.nih.gov/genomes/. Here all the directories are listed and can be visited via any modern browser. In this article, we are concerned about the reference genomes from either GenBank or RefSeq databases.

In both databases, bacterial references are available under following paths;

Genbank: ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/
RefSeq: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/

Furthermore, the information regarding individual assemblies are available under;

Genbank: ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
RefSeq: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

Use following command to download the summary file.

wget ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt#ORwget ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

Our first step would be to download this text file. The first line of the file will say the following information.

#   See ftp://ftp.ncbi.nlm.nih.gov/genomes/README_assembly_summary.txt for a description of the columns in this file.

The critical information summary of the readme

Column 12: “assembly_level”: this field could be one of the choices; “Complete Genome”, “Scaffold”, “Chromosome”, “Contig”. Usually, the “Complete Genome” rows are important for analysis.
Column 14: “genome_rep”: Could be either “Full” or “Partial”. Full contains assemblies from WGS reads.
Column 11: “version_status”: Could be “latest”, “replaces” or “suppressed”. It might be common to go with the “latest” assembly.
Column 20: “ftp_path”: Path to download the files. In the assembly path, you can use the wildcard file ending “*_genomic.fna.gz” to select the fast assembly file.

In the Bash terminal, you can use the following command to obtain the FTP paths for downloading the references.

awk -F '\t' '{if($12=="Complete Genome") print $20}' assembly_summary.txt > assembly_summary_complete_genomes.txt

Here we chose Column 12 to be “Complete Genome”. Now in the assembly_summary_complete_genomes.txt file you’ll see the available FTP paths to download the genomes from. If you check the file it’ll look like the following (first 3 lines);

ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/010/525/GCF_000010525.1_ASM1052v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/365/GCF_000007365.1_ASM736v1
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/007/725/GCF_000007725.1_ASM772v1

Downloading the Fasta Files

It is better to create a folder as references using the following command.

mkdir references

Now you can execute the following command to download the gzipped FASTA files.

for next in $(cat assembly_summary_complete_genomes.txt); do wget -P references "$next"/*genomic.fna.gz; done

Now you’ll see gz files for all the assemblies that satisfy your condition specified in the awk command before.

You will be required to either extract them or gather them to a single file for downstream analysis. We suggest the readers have a look at the following article to see how to make a BLAST database.

How to Use BLAST in Your Research

How to use bio-python based blast search to support research work

medium.com

We hope this article would help future researchers to obtain datasets needed for their bacterial studies. Happy reading. Cheers!