How to Download Nucleotide Sequences from NCBI in 3 Minutes

Alice Heiman
3 min readJan 16, 2024

--

The National Center for Biotechnology Information (NCBI) provides a wealth of open genomic and biomedical data.

To download a nucleotide sequence, first navigate to https://www.ncbi.nlm.nih.gov/nucleotide/. In the search box, perform a nucleotide search by entering the scientific name of the species you are interested in.

Using the nucleotide search in the NCBI database.

For example, to search for tulip nucleotide sequences, I would search for the genus Tulipa. In this case, I get over 2800 results. Each sequence is associated with an accession number (a unique identifier given to the sequence) and other information such as the length of the sequence in base pairs (bp).

An example nucleotide sequence for Tulipa.

The actual nucleotide sequences are formatted using two main file types:
- FASTA
- GenBank

FASTA files start with “>” followed by the accession number followed by the gene sequence itself.

An example of a FASTA file.

GenBank files, in addition to the sequence itself, also provide additional information such as where and when the sequence was taken from and by who.

An example of a GenBank file.

To download the file in either format, press the “Send to” dropdown and chose the following options:
- “Complete Record”
- “File”
- FASTA or GenBank

Download box.

FASTA files will be saved as sequence.fasta while GenBank files will be saved as sequence.gb.

To load and analyze the sequences with code, we can use the popular python bioinformatics library Biopython.

To install Biopython, you type:

pip install biopython

Then, open up a new python file and type the following:

from Bio import SeqIO

# for GenBank files
for seq_record in SeqIO.parse("sequence.gb", "genbank"):
print(seq_record.id) # accession number
print(len(seq_record)) # length of sequence
print(seq_record.seq) # sequence

# for FASTA files
for seq_record in SeqIO.parse("sequence.fasta", "fasta"):
print(seq_record.id) # accession number
print(len(seq_record)) # length of sequence
print(seq_record.seq) # sequence

The seq_record.seq object contains the actual sequence.

To do further analysis, you can continue using Biopython to, for example, do transcription and translation. For more information on Biopython, visit https://biopython.org/.

And that’s it!

From GIPHY.

Thank you for reading!

Cover photo by Sangharsh Lohakare on Unsplash.

--

--