Using the NCBI BLAST Interface For Identifying Homologous Sequences In Differentiating Proteins Inside Arabidopsis Thaliana

Karthik Mittal
Analytics Vidhya
Published in
7 min readApr 12, 2021
Around 5–100 million variants of species live in Planet Earth! Source: Discovery

It’s extraordinary how many species of organisms currently live in planet on Earth. It’s estimated that it’s upwards of 10 million species, split up evenly throughout the world’s different biomes. The amount of biodiversity is astonishing, but it makes it difficult for organization and classification.

Finding genetic similarities between varieties of organisms can be extremely difficult as the researcher has to parse through trillions of cells for each individual species. If you factor that in with the numerous amount of species living in Earth and the relatively slow computational runtime of archaic searching algorithms, in many cases, no similar sequences can be found.

Inside the NCBI BLAST algorithm search page. Credit: widdowquin.github.io

Luckily, through using the BLAST algorithm, the NCBI has formulated a way of discovering local similarities between variants of species in a computationally inexpensive and quick way. This NCBI functionality is incredibly useful for researchers who want to scope through large genetic sequences but do not have the computational power to do so.

In this article, I’ll be talking about the fundamentals of navigating through the NCBI platform and how I used the NCBI BLAST functionality to make intriguing inferences about the Arabidopsis Thaliana species, specifically its beta-catenin repeat protein.

Why Arabidopsis Thaliana?

Credit: iNaturalist

The Arabidopsis Thaliana species, more commonly referred to as the thale cress, is often a point of introductory scientific research due to the accessibility and simplicity of its genome. It grows extremely quickly, contains a small genome (~114.5 Mb), and has lots of scientific work done on it.

The reason why I focused on beta-catenin repeat proteins was because of its vitality for the proliferation and differentiation of cells in the body. These play a significant role in the stem cell renewal of the Arabidopsis thaliana.

Being interested in the field of stem cells, I wanted to go further and find similar homologous sequences to beta-catenin repeat protein and see if there are similar features in other distinct animals.

A visual representation of the beta-catenin repeat protein. Credit: Journal of Cell Science

This identification can lead the way for increasing or reducing cell proliferation in this plant through other similar proteins not familiar to the Arabidopsis Thaliana species. Expanding further from this research, similar proteins not known by the thale cress can be used to point where its functionality relies mainly on external proteins.

This use of foreign proteins can have immense implications on genetic research done in the field; therefore, seeing how accessible the NCBI BLAST platform was, I decided to experiment and determine similar proteins.

Overview of NCBI

Credit: UC Berkeley Library Update

The NCBI hosts one of the largest repositories of biological data, containing genetic information from all forms of life on Earth. Entire genomes are organized and cross-referenced for easy accessibility to scientists.

If you know what you are looking for and know what you want to get out of the result the NCBI provides, then this tool can be extremely useful to computational biologists interested in formulating novel ideas in the field.

Often, NCBI uses accession numbers or unique identifiers for particular sequences to identify the various genomes that it has in their genetic platform. These numbers don’t change even if it’s mutated at the author’s request. A subsequent version number (incrementing by one each time) is placed on this accession number to show the history of this genome in NCBI. An example of an accession number if GA1020304.2 where the accession number is appended by ‘.2’ (the second version).

What you’re looking for can be searched through NCBI’s database by this accession number or by gene keywords (e.g. ubiquitin-protein ligase).

For this article, we’ll be focusing on the beta-catenin repeat protein (part of the Arabidopsis thaliana or thale cress plant species). Note that this particular protein and species can be interchanged for any specific protein that is inside of the NCBI database. This protein has accession number NP_001318308.

Snapshot of the beta-catenin repeat protein genomic region. Source

What I find so intriguing about NCBI is that they show the specific genomic sequence of that specific protein, showing information like the exons, the conserved domains, and even the gene’s location in the genome.

The most commonly used format for these nucleotide sequences is the GenBank Flatfile format which contains information like the taxonomic information from the NCBI database and the sequence itself.

Implementing the BLAST Algorithm

Note: If you’re not already familiar with the BLAST algorithm’s inner workings, then I highly suggest that you check out my previous article, explaining the BLAST algorithm and its role in string matching.

Terminology

Before going over the implementation of the BLAST algorithm, it’s important to note how sequences can be deemed as “similar”, meaning that they share a significant number of nucleotides with one another.

Simplistic visual representation of the different sequence types. Credit: Wikipedia

Homologous sequences are sequences that are related through common ancestry, orthologous sequences are ones that are related through a past speciation event (when the evolutionary lineage splits), and paralogous sequences are sequences related through a past duplication event.

Through the query sequence entered in by the user, HSPs or high scoring segment pairs (subsequence matches between your query sequence and the database sequence returned by BLAST) are found.

Often, the BLAST algorithm only checks for local alignment (through a portion, not the whole of the sequence) rather than global alignment and focuses its attention more on homologous sequences rather than orthologous and paralogous as that is what interests scientists the most.

Steps for BLAST

Another simplistic method of using BLAST. Credit: Guide on the Side

On the Search NCBI page, search your accession number (in our case, NP_001318308) against all databases. Then click the gene link to visualize the mRNA and protein tracks inside of the specific protein.

Use the “Run BLAST” link to analyze this sequence part of the webpage. Fill in the requisite information depending on what you want to get out of this search and click the BLAST command to test it out!

How the BLAST query page looks like.

For my particular project, I made a few noticeable changes:

  • I changed the default Database to Other (nr/nt) since our sequence is non-human (the thale cress being a small flowering plant).
  • I changed Program Selected/Optimized to Somewhat similar sequences (to specifically use the BLASTN algorithm)
  • I inputed a FASTA sequence at the top instead of the common accession number. This is simply a nucleotide representation of the accession number and is often used for specific parts of the protein itself.

The BLAST HSPs were quantified in three different ways, but I mainly focused on the E value or expect value, which was a value based on the number of different alignments with scores at least as that observed (higher than those that simply occur by chance. This means the score is inversely correlated with the E value. E = mn2^-S with m being the length of the query sequence and n being the size of the database itself. For more information, check this link.

Project Results

BLASTN output descriptions for Arabidopsis thaliana beta-catenin repeat protein.

As evidenced by the picture above, there were multiple genetic sequences that had E values of 0.0. Since low E values correspond to higher scores, these genetic sequences can be seen to be almost genetically identical to the query sequence that was inputted by the user.

Many of the sequences found above were closely related to the Arabidopsis Thaliana species already; however, a few had distinct differences being evolutionary partners even though they had E values of zero.

These specific homologous sequences are currently a point of research that I’m exploring as I strive to go deeper and discover if the sequences found by the BLAST algorithm hold any impact.

Sequence to sequence output alignments for the beta-catenin repeat protein.

TL;DR

  • The NCBI hosts a wide variety of genetic information where specific proteins can be accessed by researchers through accession numbers.
  • My project mainly focused on the Arabidopsis Thaliana species, specifically the beta-catenin repeat protein and how the BLAST algorithm can be used to find similar homologous sequences.
  • Using the BLAST NCBI functionality, homologous sequences were found. Surprisingly, sequences with an E value of zero (almost perfect alignment) were not part of the Arabidopsis Thaliana species.
  • Further research is being done on these sequences to see whether they can be implemented as a substitute for the beta-catenin repeat protein.

Additional Resources

Hi! I am a 16 year old currently interested in the fields of machine learning and biotechnology. If you are interested in seeing more of my content and what I publish, consider subscribing to my newsletter! Check out my March newsletter here! Also, check out my LinkedIn and Github pages. If you’re interested about personal mindsets or just stuff in general, sign up for a chat using my Calendly.

--

--