How to download All Bacterial Assemblies from NCBI

A short note on how one might download the complete bacterial assemblies from NCBI for research work

--

One of the most important steps in genome analysis is gathering the data required for downstream research. This sometimes requires us to have the assembled reference genomes (mostly bacterial) so we can verify the classifiers trained or bins detected are correct and useful. This is often achieved using a BLAST search against the candidate reference genomes. However, it is very convenient to have our own BLAST database set up in advance if you are going to make a lot of search queries in future. Using the NCBI Web BLAST might not be a viable option if the project is long-running one with many experiments.

In this article, we will see how we can download the set of all the available bacterial references (or assemblies) from either GenBank or RefSeq databases. This wasn't quite straightforward, hence we present an article dedicated to this particular task.

Image by Arek Socha from Pixabay

Downloading Assembly Information

Assembly metadata is available at NCBI under the URL: ftp://ftp.ncbi.nih.gov/genomes/. Here all the directories are listed and can be visited via any modern browser. In this article, we are concerned about the reference genomes from either GenBank or RefSeq databases.

In both databases, bacterial references are available under following paths;

  • Genbank: ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/
  • RefSeq: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/

Furthermore, the information regarding individual assemblies are available under;

  • Genbank: ftp://ftp.ncbi.nih.gov/genomes/genbank/bacteria/assembly_summary.txt
  • RefSeq: ftp://ftp.ncbi.nih.gov/genomes/refseq/bacteria/assembly_summary.txt

Use following command to download the summary file.

Our first step would be to download this text file. The first line of the file will say the following information.

The critical information summary of the readme

  • Column 12: “assembly_level”: this field could be one of the choices; “Complete Genome”, “Scaffold”, “Chromosome”, “Contig”. Usually, the “Complete Genome” rows are important for analysis.
  • Column 14: “genome_rep”: Could be either “Full” or “Partial”. Full contains assemblies from WGS reads.
  • Column 11: “version_status”: Could be “latest”, “replaces” or “suppressed”. It might be common to go with the “latest” assembly.
  • Column 20: “ftp_path”: Path to download the files. In the assembly path, you can use the wildcard file ending “*_genomic.fna.gz” to select the fast assembly file.

In the Bash terminal, you can use the following command to obtain the FTP paths for downloading the references.

Here we chose Column 12 to be “Complete Genome”. Now in the assembly_summary_complete_genomes.txt file you’ll see the available FTP paths to download the genomes from. If you check the file it’ll look like the following (first 3 lines);

Downloading the Fasta Files

It is better to create a folder as references using the following command.

Now you can execute the following command to download the gzipped FASTA files.

Now you’ll see gz files for all the assemblies that satisfy your condition specified in the awk command before.

You will be required to either extract them or gather them to a single file for downstream analysis. We suggest the readers have a look at the following article to see how to make a BLAST database.

We hope this article would help future researchers to obtain datasets needed for their bacterial studies. Happy reading. Cheers!

--

--