Introduction to Bioinformatics

The stepping stones to deciphering the secrets of life

Published in

The Research Nest

8 min readOct 26, 2023

All images are created by the author using DALLE3 or Midjourney

From the discovery of DNA to modern sequencing technologies, numerous tools and techniques are at our disposal to decode biological mysteries.

Three key opportunities here:

Dive into past data to find patterns that might have been missed.
Invent new methods to handle and interpret the growing data more effectively.
Explore correlations between different datasets to uncover novel insights.

But what exactly is bioinformatics?

Imagine you have a massive library with billions of books, and you’re searching for specific information, like why some people have blue eyes or how certain plants resist pests. Searching by hand would take lifetimes.

Bioinformatics is like having a super-fast librarian with a computer. This librarian can swiftly search through all the books, find the needed information, and even predict where new information might be found.

Here’s how it helps:

Speed: Computers can analyze massive amounts of biological information rapidly.
Accuracy: Computers reduce the chance of human errors, making the findings more reliable.
New Discoveries: By spotting patterns in the data, bioinformatics can predict new findings, like potential drug targets or understanding diseases better.
Combining Data: It can blend information from different sources for a more holistic understanding. For instance, comparing the data of a healthy person and someone with a disease can highlight what’s going wrong.
Cost-Efficient: Instead of conducting expensive lab experiments repeatedly, researchers can first use bioinformatics to predict outcomes and then validate them in the lab, saving time and money.

Where’s all the biological data?

Here are some popular sources:

1. Genomic Databases

GenBank: It’s a massive collection of all publicly available DNA sequences.
ENA (European Nucleotide Archive): Similar to GenBank but maintained by the European Bioinformatics Institute.
DDBJ (DNA Data Bank of Japan): Japan’s version of the above databases.

2. Protein Databases

Protein Data Bank (PDB): It stores 3D structures of proteins, nucleic acids, and other biomolecules.
UniProt: A comprehensive database of protein sequences and annotations.

3. Gene Expression Databases

Some context: Every cell in our body has the same DNA, but not every gene in the DNA is “turned on” or active in every cell. Which genes are active (or “expressed”) determines the function of the cell. For instance, a liver cell has different active genes than a brain cell. Gene expression is the process by which specific genes are activated to produce a required protein.
GEO (Gene Expression Omnibus): It archives and freely distributes microarray (A technology that allows us to measure the expression levels of thousands of genes at once), next-generation sequencing, and other forms of high-throughput functional genomics data.

4. Pathway and Interaction Databases

Some context: At the core, our cells are bustling mini-cities where a lot of processes happen simultaneously. Genes get turned on and off, proteins are made, and these proteins interact with each other to perform tasks. Just like in a city, where there are roads, intersections, and signals guiding traffic, in cells, there are pathways and interactions that guide these processes.
KEGG (Kyoto Encyclopedia of Genes and Genomes): A database that provides information on systems, functions, and diseases using graphical maps.
BioGRID: Holds data on protein and genetic interactions.

5. Disease and Variation Databases

ClinVar: Contains information about genomic variation and its relationship to human health.
OMIM (Online Mendelian Inheritance in Man): A database that catalogs human genes and genetic disorders.

As you can see, there is a LOT of publicly available data to explore.

What does the data actually look like?

The type of data in these databases seems abstract, but they come down to texts, numbers, and sometimes visual representations. Here’s a basic idea of what they “look” like:

Genomic Data: This is usually represented as long sequences of letters corresponding to the four nucleotide bases of DNA: A (adenine), T (thymine), C (cytosine), and G (guanine). For example:

ATCGGCTAACGTAAGCTT...

Protein Data: These are like genomic data but use different letters for the 20 amino acids. For example:

METGFSAKIRL...

Proteins are also represented via 3D Structures. These are visual representations, often colorful diagrams, showing the complex folded structure of proteins. They can be explored using specialized software.

Gene Expression Data: This can look like a table or spreadsheet where genes are listed on one axis and various conditions or time points on another. The cells of the table have numbers indicating how much each gene is “turned on” or “off.”

What can you do with this data?

We can do a LOT.

Disease Understanding and Treatment
— Identify genetic mutations linked to diseases.
— Find targets for new drugs.
— Understand the genetic basis of disease susceptibility and resistance.
Drug Design and Development
— Predict how drugs will interact with their target proteins.
— Identify potential side effects by seeing what other proteins a drug might interact with.
— Accelerate drug testing by simulating its effects before lab trials.
Comparative Genomics
— Compare DNA sequences across different species to trace evolutionary relationships.
— Identify genes that are conserved (remain unchanged) across species, suggesting they have crucial functions.
— Discover genes that give species their unique traits.
Personalized Medicine
— Tailor medical treatments based on an individual’s genetic makeup.
— Predict an individual’s risk of developing certain diseases.
Agriculture
— Enhance crops by identifying genes for desirable traits like drought resistance or higher yield.
— Study pests and pathogens to develop more effective control measures.
Forensics
— Use DNA sequences to identify individuals or determine paternity.
— Trace the origin of outbreaks or bioterrorism acts by analyzing microbial genomes.
Ecology and Conservation
— Understand biodiversity by sequencing DNA from environmental samples.
— Track endangered species or study the genetics of populations to aid in conservation efforts.
Functional Genomics
— Understand what specific genes do, when they’re active, and how they interact with other genes.
— Predict the function of unknown genes based on similarity to known genes.
Structural Biology
— Study the 3D shapes of proteins and other biomolecules.
— Understand how changes in shape (due to mutations or drug binding) might affect function.
Synthetic Biology
— Design new biological systems or modify existing ones for beneficial purposes, like biofuel production or waste treatment.
Evolutionary Studies
— Trace the ancestry and migration of populations or species.
— Study the origin and spread of antibiotic resistance among microbes.

Every single point above can form the basis of a new and niche problem statement that you can create for yourself to explore. The key lies in combining biological data with innovative computational techniques to unlock insights and solutions that were previously unimaginable.

Practical Examples

Let’s dive deep into the world of bioinformatics. In each tutorial, we’ll cover:

Building the intuition to think about a biological problem.
How to obtain and process the required biological data.
Implementing our approach or algorithm to analyze the data.
Interpreting and documenting our findings.

Here are some beginner-friendly ideas:

Analyzing DNA Sequence Similarity
Understand the basics of sequence alignment and find how similar two DNA sequences are.
Identifying Disease-Linked Genetic Mutations
Explore real genomic data to pinpoint mutations that might be associated with certain diseases.
Visualizing Protein Structures
Get a 3D look at proteins and understand their intricate shapes and functions.
Tracing Evolution with DNA
Compare sequences from different species to infer their evolutionary relationships.
Predicting Gene Functions
Based on known gene annotations, try to predict the functions of unknown or lesser-studied genes.

Here are more examples for inspiration if you are looking for something more specific and complex.

Explore the evolution of antibiotic resistance in a common pathogen over the past decade.
—Data: Archived genomic sequences of the pathogen from various time points.
Investigate the similarity in genes related to any function between humans and other mammals.
— Data: Genomic sequences of humans and select mammals (e.g., dogs, cats, mice).
Explore underutilized aquatic plants or algae with genomic indicators of high nutritional value that could be promoted as “superfoods”?
— Data: Genomic sequences of various aquatic plants and algae, existing nutritional data of known superfoods.
Examine the evolution of bioluminescence in deep-sea creatures.
— Data: Genomic or protein data from bioluminescent organisms and closely related non-bioluminescent species.
Predict the potential function of uncharacterized genes in a newly discovered deep-sea microbe by comparing them to known genes in terrestrial microbes.
— Data: Whole genome sequence of the deep-sea microbe and genomic databases of well-studied terrestrial microbes.
Can we modify the metabolic pathways in E. coli to enhance the production of biofuel (like bioethanol) using genes from biofuel-producing algae?
— Data: Genomic and metabolic pathway data for E. coli and the biofuel-producing algae.

We shall explore all the above in future tutorials where we can actually try to solve the above problems and implement various techniques. #StayTuned.

(This article will be updated with the links to those tutorials once they are ready.)

Future Directions

Integrative Multi-Omics Analysis
—As technology advances, researchers can collect data not just on genomes (genomics) but also on transcriptomes (transcriptomics), proteomes (proteomics), metabolomes (metabolomics), and more. Integrating data from all these “omics” layers can provide a holistic understanding of biological systems.
— This will lead to a more comprehensive understanding of diseases, better drug design, and insights into complex biological phenomena that can’t be understood by looking at just one layer.
— The sheer volume and diversity of data types require advanced computational techniques and algorithms for integration and new methods to visualize and interpret the integrated data.
Personalized and Precision Medicine
— Bioinformatics plays a central role in analyzing individual genomic information to tailor medical treatments to individual patients.
— This can lead to more effective treatments with fewer side effects, better disease prediction, and prevention strategies tailored to an individual’s genetic makeup.
— There are ethical and privacy concerns surrounding personal genomic data. Additionally, while we can sequence genomes rapidly, interpreting what all the genetic variations mean for an individual’s health remains a massive task.
Artificial Intelligence and Machine Learning in Bioinformatics
— With the abundance of biological data available, AI techniques are becoming indispensable tools for data analysis, pattern recognition, and prediction in bioinformatics.
— AI can accelerate drug discovery, improve the accuracy of disease diagnosis, predict disease outbreaks, and even decipher the functions of mysterious genes or genetic regions.
— These techniques require large, well-curated datasets for training. There’s also the “black box” nature of some AI models, where their decision-making process might not be transparent, making it crucial to develop interpretable models.

Want to explore more and collaborate on projects? Find me on LinkedIn or Twitter!

Feel free to leave your thoughts and ideas in the responses.

Read a similar analysis I did on astronomical data science here.