How to Download Nucleotide Sequences from NCBI in 3 Minutes
The National Center for Biotechnology Information (NCBI) provides a wealth of open genomic and biomedical data.
To download a nucleotide sequence, first navigate to https://www.ncbi.nlm.nih.gov/nucleotide/. In the search box, perform a nucleotide search by entering the scientific name of the species you are interested in.
For example, to search for tulip nucleotide sequences, I would search for the genus Tulipa. In this case, I get over 2800 results. Each sequence is associated with an accession number (a unique identifier given to the sequence) and other information such as the length of the sequence in base pairs (bp).
The actual nucleotide sequences are formatted using two main file types:
- FASTA
- GenBank
FASTA files start with “>” followed by the accession number followed by the gene sequence itself.
GenBank files, in addition to the sequence itself, also provide additional information such as where and when the sequence was taken from and by who.
To download the file in either format, press the “Send to” dropdown and chose the following options:
- “Complete Record”
- “File”
- FASTA or GenBank
FASTA files will be saved as sequence.fasta while GenBank files will be saved as sequence.gb.
To load and analyze the sequences with code, we can use the popular python bioinformatics library Biopython.
To install Biopython, you type:
pip install biopython
Then, open up a new python file and type the following:
from Bio import SeqIO
# for GenBank files
for seq_record in SeqIO.parse("sequence.gb", "genbank"):
print(seq_record.id) # accession number
print(len(seq_record)) # length of sequence
print(seq_record.seq) # sequence
# for FASTA files
for seq_record in SeqIO.parse("sequence.fasta", "fasta"):
print(seq_record.id) # accession number
print(len(seq_record)) # length of sequence
print(seq_record.seq) # sequence
The seq_record.seq object contains the actual sequence.
To do further analysis, you can continue using Biopython to, for example, do transcription and translation. For more information on Biopython, visit https://biopython.org/.
And that’s it!
Thank you for reading!
Cover photo by Sangharsh Lohakare on Unsplash.