Open Biopipeline

Published in

The Startup

6 min readMay 30, 2020

What Is Open Biopipeline

Open Biopipeline is an open source bioinformatics tool for general and broad purposes, specifically the goal of this project is to develop a tool for researchers or students to use in order to investigate genes. This tool combines several popular existing bioinformatic tools, such as BLAST, KEGG, GO, and Protein Atlas, by consolidating them into one singular location. By using this tool you are able to retrieve: function gene annotation, protein function, clinical relevance, specific patient case information, and much more. The aim of this article is to guide you through some of the specific functions that Open Biopipeline has to offer as well as explain the demand and need for such a product.

The motivation of this project is: to help researchers conduct very easy gene analysis, help students better understand the process of building pipelines, and consolidate several tools into one.

What is Open Biopipeline for?

Open Biopipeline is designed to do and help with the following items:

Investigate target molecules for potential therapeutics
Discover protein function and its expression in clinical cases
Expand and illustrate the specific pathway of genes using KEGG analysis
Provide a foundation for building your own custom pipeline
Enable real-time analysis for nucleotide sequences
Identify upstream and downstream molecules for drug targeting

What problem does Open Biopipeline solve?

Typically analysis in complex distributed experiments take time and computation of several desired outputs take time and will require the process to be repeated at some point. By using this specific tool you circumvent the need to preform biological analysis on a case by case basis.

For example, for an analysis that depends on 6 services where each service takes ~10 min to analyze, read and get ready, here is what you can expect:

1 service takes 10 min
6 services take 60 min
Setup of Open Biopipeline takes 5 min
Open Biopipeline cuts analysis time significantly

Reality is generally worse so individually accessing each tool can take longer, whereas Open Biopipeline is consistent, free and quick.

Refer to figure 5 for a sample pipeline that can, and has been, built with this package.

Some Features of Open Biopipeline

Basic Local Alignment Search (BLAST) and Pairwise Sequence Alignment

def blast():
    for file in os.listdir('fasta'):
         filename = os.fsdecode(file)
         if filename.endswith(".fasta"):
             print(filename)
             fasta_file.append(filename)    record = SeqIO.read('fasta/'+fasta_file[0], format="fasta")    result_handle = NCBIWWW.qblast("blastn", "nt", record.seq)    blast_result = open("results.xml", "w")
    blast_result.write(result_handle.read())
    blast_result.close()
    result_handle.close()

Figure 1. Block of code describing a BLAST search for folder of fasta files

Kyoto Encyclopedia of Genes and Genomes (KEGG)

def get_pathway(protein_name):
    items = scrape('https://www.kegg.jp/kegg-bin/search_pathway_text?map=map&keyword='+protein_name+'&mode=1&viewImage=true','img')
    url_ext = []
    for i in range(len(items)):
        #print(i,items[i])
        url_ext.append(items[i].parent['href'])
        #print(url_ext[i])images = scrape('https://www.kegg.jp'+url_ext[0],'img')urllib.request.urlretrieve('https://www.kegg.jp'+images[2]['src'], 'pathway_img/'+protein_name+"_pathway.jpg")return 'https://www.kegg.jp'+images[2]['src']

Figure 2. Code that webscrapes online database for cell signal cascade

**Figure 3.** Image of cell signal cascade for VEGFA gene (result from figure 2)

The above diagram can be become incredibly useful when researching an entire genome, since several protein can be queried at once. This particular tool utilizes a database from KEGG which is able to display complex pathways quickly and efficiently, in addition to saving it to your local system. Moreover the output from this tool can be combined with inputs to other tools so as to more information. For example, by passing in a target gene name (one that can be obtained from the pathway image) the function can then pull relevant protein annotation and even relevant clinical data.

Protein Atlas

def protein_atlas_parser(xml_file):
    root = ET.parse(xml_file).getroot()for i in range(len(root)):
        try:
            prot_name = root[i][0].text
            tissue_expr = root[i][20][3][0].text
            desc = root[i][7][1].get('description')
            path_expr =  root[i][8][0].text        except:
            print("\nAn error has been thrown, please handle\n")print(i,'Protein Name:', prot_name,'\n  Tissue expression summary: ',tissue_expr,'\n  Description: ',desc,'\n  RNA Cancer Specificity:', path_expr)

Figure 4. XML parser function for clinical data and patient case information (only displays cancer relevance)

Example Use Case of Open Biopipeline for Investigating Cancer Therapeutic Target / Tutorial

**Figure 5.** Flow chart diagram of bioinformatic pipeline, displaying flow of input/outputs

Starting with an unknown sequence of nucleotides the program is fed into the BLAST algorithm to pairwise match the nucleotides for a specific gene. Refer to gene_items.md in my GitHub repository for the gene name and accession number of the example genes given. The BLAST function looks for a directory called fasta and attempts to sequence all of them and saves them as an xml file for parsing (current code only saves first fasta file). If you have nucleotide sequence as only .txt then there is convert_to_fasta.py file inside of the txt folder.

>>> from blast import * 
>>> blast()

Then parse the xml file that has just been generated.

>>> from xml_parser import blast_parser
>>> blast_parser(results.xml)
'VEGFA'
>>> VEGFA = 'VEGFA'

The next step is to understand the gene/protein function by using the UniProt database. By inputting a gene name, we can look for the specific amino acid sequence AND the protein function. You can see here that we don’t use gene ontologogy (GO) directly in our analysis since functional annotation is not only partially given by UniProt, but one can retrieve GO information from UniProt links.

>>> from fetch_uniprot_metadata import *
>>> gene_entry, genes = protein_entry(VEGFA)
>>> protein_function = protein_function(gene_entry) # saves function as string
>>> amino_acid_seq = protein_AASeq(gene_entry) # amino acid sequence

Next, we preform pathway analysis on the targeted gene so one can identify the cellular response of the specific cell cascade. This helps us gain understanding of the upstream and downstream molecules; moreover, this can aid in therapeutic drug research.

>>> from get_kegg_pathway import get_pathway
>>> get_pathway(VEGFA)

Refer to figure 3 for the pathway interactions of the vascular endothelial growth factor A (VEGFA) gene.

Lastly, we pass through the gene name into Protein Atlas which gives us valuable and relevant information on the protein expression in patients. This proves very desirable since one can now look at specific RNA cancer expression or the description of the protein within the clinical context.

>>> from protein_atlas import get_atlas_xml
>>> get_atlas_xml(VEGFA)

This downloads an xml file in a new folder called protein_atlas under another new folder called xml which then needs to be parsed. The naming convention of the files is as follows: gene_protein_atlas_data.xml where gene is the name of the target gene.

>>> from xml_parser import protein_atlas_parser
>>> protein_atlas_parser('xml/protein_atlas/VEGFA_protein_atlas_data.xml')0 Protein Name: VEGFA
  Tissue expression summary:  Most cancers showed strong cytoplasmic immunoreactivity. Lymphomas were in general moderately stained.
  Description:  Antibody staining mainly consistent with RNA expression data. At least one protein variant secreted, tissue location of RNA and protein might differ and correlation is complex.  
  RNA Cancer Specificity: Low cancer specificity
...

Above is a sample output of the code that has been run.

When all these tools are used in conjunction, one can build a bioinformatic pipeline, specifically this case study was following the schematic shown in figure 6 (skipping the GO portion).

Overall, these function are powerful tools that can significantly cut down the time to query each tool individually and serves as an excellent tool for the introduction of both bioinformatics and python.

What Design Principles Underlie Open Biopipeline?

Open Biopipeline works by:

Expediting biological analysis and consolidating relevant gene information to one space
Using NCBI BLAST algorithm to pairwise match nucleotide
Using UniProt to fetch gene metadata
Web scraping through KEGG and Protein Atlas
Relevant clinical information as xml file
Storing all important input information to one single place

Please read my https://github.com/gitUmaru for an example of results and a better look at the code.