Genbank Connect

Published in

Algorithms For Life

6 min readDec 14, 2017

In a previous article, I mentioned that I had written a script that obtained FASTA sequences from the GenBank database. In this article, I’ll be going over my approach to solving this problem, the code I wrote, and the rationale behind my problem solving/coding decisions.

The GenbankConnect Script - Version 1

Whenever I write code to solve a problem, my aim is to start simple. This enables me to determine an effective solution as quickly as possible. The first thing I do is to clearly write out the problem.

As you can see from the writing above, the problem is defined in one to two lines. Any time I’ve struggled with writing code, nine times out of ten, it’s because I wasn’t sure what the problem was.

With the problem defined, I move onto listing possible solutions. With the exception of the last solution, I try to keep it as realistic as possible :)

When coming up with solutions, I try to choose an option that produces a result in the shortest amount of time or with the least amount of work. This is usually done by writing out pros/cons or by visualizing the steps needed to implement a solution successfully.

Having chosen my solution I used that to produce a list of steps to get to my end goal, a list of FASTA sequences

Determine libraries to import
Store accession numbers and id’s
Produce a URL that takes an accession number along with GenBank info
Loop through each accession number and make a request to GenBank with that URL
Get the DNA sequence associated with that accession number
Format the DNA sequence to FASTA if necessary
Save FASTA sequences to a text file

Alright. So first step is figure out what libraries I need to import. Since I’m making HTTP requests to GenBank, a library to handle that would be great. In this case the Requests library fit the bill.

import requests

As far as I could tell that was really all I needed so on to step two! Storing the accession numbers and id’s.

The above image shows the data I was working with which was stored in a CSV file. In this case I was only concerned with column one, the accession number and column three, the origin of that fire ant sequence. I chose four rows at random and stored the data in columns one and four in an array. All four arrays were stored in a larger array.

acc_values = [["JF778997", "andorraMadriu"], ["JF779064", "austriaAchenkirch"], ["JF778998", "belgiumBryssel"], ["JF778877", "bulgariaVitosha"]]

With the accession numbers and locations stored the next step was to have a URL that Requests could use to connect to GenBank. Before writing any code I tried inputting an accession number — JF778997 into GenBank.

GenBank result for accession number JF778997

The URL generated was “https://www.ncbi.nlm.nih.gov/nuccore/JF778997”, which was a page that contained the associated DNA sequence. However, by using that URL with Requests, all the text on the page would be returned. Not very helpful and a lot of work to parse. Clicking the FASTA button wasn’t a lot of help either since I’d get the sequence for sure but a lot of extraneous text as well.

FASTA page for accession number JF778997

Luckily my google-fu is pretty good and it turns out that GenBank provides specific URL’s for getting different types of information!

Entrez Programming Utilities Help

E-utilities Introduction

www.ncbi.nlm.nih.gov

From the code below you can see that URL returns a sequence as text. Perfect! I did a quick test using Requests and got the associated FASTA sequence.

url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=JF778997&rettype=fasta&retmode=text"requests.get(url).text

Step four on my list was to loop through the array storing all the accession numbers to get a single one. Each time the for loop ran, it had to grab the next accession number and update the URL with it.

To accomplish this the URL was broken into two parts and then concatenated with the new accession number at the start of every iteration of the loop.

genbank_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id="for i,j in acc_values:                                                                 
    url = genbank_url + i + "&rettype=fasta&retmode=text"

From there, the URL was passed to Requests which connected to GenBank and returned the DNA sequence associated with the accession number in the URL.

def getPage(url):                        
    soup = BeautifulSoup(requests.get(url).text, "lxml")
    return soup.stringresponse = str(getPage(url))

With the for loop taking care of steps four and five, step six was to format the DNA sequence.

newFasta = ">" + j + " " + header[1:] + "\n" + "".join(response.splitlines()[1:])
                                                                  fastas.append(str(newFasta))

Once formatted, the sequence was stored in an array. After all the accession numbers had been stored they were all written to a text file.

writeFile = open("hicksFastaData.txt", 'w')                       for i in fastas:                        
    writeFile.write(i + "\n\n")writeFile.close()

And this is the output!

Text file containing formatted FASTA sequences

While this was exciting this was only for four accession numbers. I still had 396 more to go through. As fun as copy pasting all the values into an array would be (#sarcasm) reading directly from the CSV file was a lot simpler.

f = open("", "r")                       
reader = csv.reader(f)

The code above takes the name of the CSV file that contains the accession numbers for all 400 fire ant samples. From there I stored each row in an array, similar to the storage method we used in our smaller example.

#stores every row in the csv as an array                       acc_info = []
for row in reader:
    acc_info.append([row])

The last step was to only get the info I needed and store it in another array. In this case I was only concerned with the accession number, the unique id of each sample and the city where the Fire Ant was taken. These values were stored in columns one, two and three respectively. Columns two and three were concatenated together.

#stores the accession numbers and the unique identifier + city of each fire ant sample                       
acc_values = [];                       
for data in acc_info:
    acc_values.append([data[0], data[1] + "_" + data[2].split(",")[0]])

Some you might notice that a lot of this could have been done in one step, which is definitely more efficient. But my approach really helped me to troubleshoot issues. If I had done everything in one step it would have been harder to pinpoint any issues that arose.

GenBankConnect Version 2. This now reads from a CSV.

Once that was done, the script was completed! Everything else worked the same so nothing else had to be changed.

This was a really fun piece of code to write. If you enjoyed reading this feel free to check out more of my articles.

Algorithms For Life

Making Cool Things

medium.com

And drop by my website to leave me a message!

Roshan Noronha

roshannoronha.com

Thanks for reading!

Genbank Connect

Having chosen my solution I used that to produce a list of steps to get to my end goal, a list of FASTA sequences

Entrez Programming Utilities Help

E-utilities Introduction

Algorithms For Life

Making Cool Things

Roshan Noronha

Written by Roshan Noronha