Bioinformatics: An Exploration Journey Expanding the Frontiers of Science

Pınar YILDIZ
16 min readJan 3, 2024

--

Bioinformatics — PINAR YILDIZ

Hello!

After completing my undergraduate education in biology, I decided to pursue further education in bioinformatics to expand my knowledge and skills. This new journey, combined with the data science training I recently completed, has become an exciting area of exploration for me. That’s why I’ve decided to share my knowledge and daily bioinformatics work with you.

In this series of articles, I will share with you the latest developments in the field of bioinformatics, my own learning process, and my experiences in this field. My goal is to increase my knowledge in this field while also inspiring you.

I will regularly share updates about the exciting world of bioinformatics, my experiences, and my learning notes. I hope you enjoy my articles and join me on this informative journey.

You can visit the Miuul Bioinformatics Bootcamp, which has been a source of information and inspiration for this content, and check out the educational resources and materials in this field. During this bootcamp, I owe a special thanks to our instructor Zeynep Akdeniz, who constantly motivated us throughout our learning process with her inspiring attitude and energy, ensuring that we never lost our attention. Her support and knowledge made this training process much more valuable and effective for me. For more information, you can visit the Miuul Bioinformatics Bootcamp website.

In today’s world, where science and technology are advancing rapidly, there lies a field at the heart of this progress: Bioinformatics. This fascinating discipline, where biology, computer science, and statistics intersect, utilizes modern technology to decipher the secrets of genetic codes, protein structures, and biological systems. So, what contributions does Bioinformatics bring to our lives, and what does it promise for those wishing to pursue a career in this field?

Bioinformatics is revolutionizing the understanding of health and diseases. The analysis of genetic data is opening new horizons in the diagnosis and treatment of diseases. For instance, by uncovering the genetic foundations of complex diseases like cancer, it allows us to develop more effective treatment methods. Personalized medicine practices are also a significant advantage provided by Bioinformatics.

The drug development process is also becoming faster and more effective thanks to this discipline. The discovery of new drug targets and the modeling of drug interactions become possible with Bioinformatics tools. This has the potential to revolutionize the treatment of diseases.

Additionally, genetic research is a significant part of Bioinformatics. The analysis of the human genome and other large genetic datasets helps us better understand our health and diseases.

For those interested in the field of Bioinformatics, this discipline offers not only a scientific curiosity but also a path to develop critical skills such as analytical thinking, problem-solving, and continuous learning. Regardless of your past field of work, by developing these skills, you can succeed in Bioinformatics.

In this article series, I will provide a guide for those considering stepping into the field of Bioinformatics, outlining the experiences, learning opportunities, and career prospects this exciting journey offers. Each new article will thoroughly explore a different aspect of this field, the skills to be learned, and the challenges you may face. I hope this series of articles will provide the inspiration and information needed to step into the world of Bioinformatics and together expand the frontiers of science. In this series, each chapter will open a different door into the world of Bioinformatics, exploring step by step the journey this discipline offers.

Genomics

Each of us possesses a unique story made up of billions of letters — our Genome. This story is hidden in our genetic code, and Bioinformatics is the powerful tool we use to decipher this code. Genomics is considered one of the greatest scientific discoveries in human history, and thanks to Bioinformatics, we can delve into the depths of research in this field.

Genomics is the scientific study of the complete genetic material, or genome, of living organisms. The human genome consists of approximately 3 billion DNA base pairs, and each part of this complex structure contains valuable information about our health and diseases. Bioinformatics provides the necessary tools and methodologies to analyze, understand, and derive practical applications from this genetic information.

If we take a closer look at genomics:

Genomic Databases: Genomic databases are large data repositories, often accessible online, that contain genomic sequences and related information of various organisms. These databases play a crucial role in genetic research and in understanding the genetic basis of diseases.

Genome Assembly: Genome assembly is the process of reconstructing an organism’s complete genome sequence using reads, which are typically short DNA sequences. This process is a fundamental step in understanding genetic information and discovering the genetic structure of organisms.

Genome Annotation: Genome annotation is the process of identifying and characterizing genes, genetic markers, and functional regions in a genome sequence. This process involves determining the locations and functions of genes and other genetic elements.

Comparative Genomics: Comparative genomics is a field that studies genetic differences and similarities by comparing the genomes of different species. This approach provides insights into evolutionary processes, genetic diversity, and the genetic origins of diseases.

Clustering: In a genetic context, clustering is the process of grouping genes or proteins that have similar genetic characteristics or expression patterns. This method is used in the analysis of genetic data and in understanding functional relationships.

Functional Enrichment: Functional enrichment analysis aims to determine how a specific set of genes is associated with certain biological functions or pathways, compared to the general population. This analysis helps in understanding the genes and pathways that play significant roles in genetic studies.

Tools

In our bioinformatics studies, we frequently utilize tools such as Python, Snakemake, PyCharm, GitHub, and Conda. These powerful tools offer us unique capabilities in many areas, from genetic data analysis to the management of complex scientific workflows.

Python: Python is a high-level, interpreted, and general-purpose programming language that focuses on readability and efficiency. It is used for writing both simple scripts and developing large applications. Python is popular in various fields such as science, data analysis, web development, and automation.

PyCharm: PyCharm is an Integrated Development Environment (IDE) developed for the Python programming language. It includes numerous tools necessary for code editing, debugging, version control, and managing Python projects. PyCharm is available in both professional and community editions.

GitHub: GitHub is a web-based version control and collaboration platform where software developers can store, share, and collaborate on their code with other developers. It uses the Git version control system and is used for a wide range of projects, from open-source to commercial projects.

Conda: Conda is an open-source package management system and environment manager for Python languages. Conda simplifies package installation, dependency management, and the creation of isolated environments for different projects. It is frequently used in the fields of scientific computing and data science.

Snakemake: Snakemake is a Python-based workflow management tool, particularly used in the field of Bioinformatics. It is used to define and execute complex data analysis workflows. Snakemake is flexible, scalable, and user-friendly, offering the ability to define workflows as code.

So far, we have shared important information about the tools we frequently use. In my next Medium article, I will share detailed information on how to download these essential tools to your computer and their installation processes as soon as possible. This will help you lay a solid foundation for your bioinformatics studies.

Hello World for Bioinformatics

As an example, I will show you how to execute a basic bioinformatics project using PyCharm, demonstrating the process through a sample project.

Open PyCharm: First, open the PyCharm application. If PyCharm is not yet installed, you can find details on how to install it in my next Medium article; I will explain everything step by step, so stay tuned!

Main Screen

View the Main Screen: When PyCharm opens, you will be presented with a main screen. Here, you can create a new project, open an existing one, or learn more about PyCharm.

New Project Screen

Create a New Project: Click on the “New Project” option on the main screen.

Select the Type of Project: PyCharm offers templates for different types of Python projects (e.g., Django, Flask). For a simple Python application, you can choose the standard project.

Configure Project Settings: When creating a new project, specify the project’s name and the location where it will be saved. Additionally, you will need to select a Python interpreter.

When creating a simple Python application in PyCharm, you would typically choose the “Pure Python” option as the “Standard Project.” In this step, you will encounter a field to specify the project’s name. This field is usually located on the right side of the interface, at the top of the list under the “Name” heading. Here, you enter a name for the project you plan to create. This name will serve as the identifier of your project and will appear under this name both in the PyCharm project explorer and in your file system. This naming should be chosen to reflect the content and purpose of your project, making it easier for you to return to and continue working on your project later.

When you open PyCharm, the default location for storing your projects is automatically determined and is typically filled in as a path specific to your user profile. For example, if your username is ‘pinar’, PyCharm automatically suggests a location like ‘Users/pinar/PycharmProjects’. This is the default directory recommended by PyCharm for new projects and for organizing your existing projects. This path is situated in a location easily accessible to the user, allowing you to centrally store your projects, thereby making it easier to manage and locate them.

When creating a new project in PyCharm, enabling the “Create Git repository” option is meant to facilitate the management of your project using Git, a version control system. This option allows your project to be managed as a Git repository from the beginning.

Git facilitates tracking code changes and reverting to previous versions, which simplifies debugging and teamwork. Additionally, it provides documentation of changes and, through remote storage, enables backup and internet access to your project.

When creating a new project in PyCharm, enabling the “Create a main.py welcome script” option is intended to add a main Python file (main.py) to your project, which can be used as a starting point.

The “Create a main.py welcome script” option provides a quick start for beginners in Python by including sample codes with basic Python features and makes understanding the workings of the IDE easier. This option also offers an ideal platform for testing new concepts and learning about Python project structure. Additionally, it provides a suitable development and testing environment for simple projects or rapid prototyping.

Interpreter type

When creating a project in PyCharm, selecting “Project: Virtualenv” as the “Interpreter” type allows you to create a dedicated virtual environment (venv) for your project.

The “Project: Virtualenv” option facilitates dependency management and prevents system-wide conflicts by providing an isolated development environment for Python projects. This choice makes your projects more organized and manageable, while also offering flexibility for testing and experimentation, allowing developers to comfortably try different packages and versions. Additionally, it enhances the portability of the project, enabling easy sharing and use across different systems and among developers.

When creating a project in PyCharm, selecting “Base: Conda” as the “Interpreter” type allows you to use a virtual environment provided by Anaconda for your project. Anaconda is a popular Python distribution widely used especially for data science, machine learning, and scientific computing.

The “Base: Conda” option allows you to benefit from the rich set of packages offered by Anaconda and to use the powerful Conda package management system. This provides a significant advantage, especially for those working in data science and similar fields.

When creating a project in PyCharm, selecting “Custom Environment” as the “Interpreter” type allows you to use a customized Python interpreter that you specify for your project.

The “Custom Environment” option allows you to use customized or different versions of Python interpreters for your projects, providing the opportunity for specialized configurations and advanced control. It offers compatibility between different Python environments to meet the unique requirements of each project. Additionally, it ensures full compatibility with Integrated Development Environments (IDEs) like PyCharm, facilitating the use of standard features like debugging and code completion with custom interpreters.

Opting for the “Custom Environment” option in the “Interpreter” settings for a project is a strategic decision, considering the future need to create various customized virtual environments for different works. This approach allows us to manage specially configured Python environments for each project or study, ensuring that each operates in an isolated manner with its own dependencies, libraries, and Python versions. This flexibility enhances our capacity to adapt and optimize according to the requirements of our projects, especially in rapidly evolving fields with constantly changing needs, such as Bioinformatics. The “Custom Environment” choice facilitates adapting to customized configurations required for our future works, while also allowing effective management of our current projects.

In our project, we specifically used Conda. The primary advantages of using Conda as a “Custom Environment” are its cross-platform compatibility, comprehensive package management, and provision of isolated environments. These features offer flexibility to developers and make Conda an ideal choice, especially for scientific and analytical projects.

Why Do We Create Different Environments?

Creating a specialized environment that contains specific tools and libraries is extremely important for the efficiency and manageability of your project in bioinformatics studies. For example, if you plan to use tRNAscan-SE, a specialized tool for tRNA detection and analysis, it would be sensible to create a specific “trnascan” environment that includes this tool. Here is an explanation of the necessity of creating this environment and the importance of downloading this library:

A customized environment prevents dependency conflicts, thereby optimizing the performance of tools and ensuring the reproducibility of research.

How Do We Create?

conda create -n trnascan python=3.11.5

Note: You should change the Python version part according to the version you are using.

When we want to use the ‘trnascan’ environment, we need to activate it. To do this:

conda activate trnascan

When we are not going to use the active environment anymore;

conda deactivate trnascan

After creating the environment, we download the necessary library for tRNAscan-SE and then we can start working.

We download the trnascan library with the help of mamba:

mamba install -c bioconda trnascan-se

Now it’s your turn to follow the same steps in PyCharm. First, create a snakemake environment and then download its library.

Let’s go ahead and start by creating our project.

After configuring your settings, click on the “Create” button to create your new project.

As an example, in our project, we will be studying Caenorhabditis elegans.

Caenorhabditis elegans

Caenorhabditis elegans, commonly known as C. elegans, is a species of microscopic roundworm (nematode). It is a widely used model organism for studying fundamental biological processes. C. elegans is about 1 mm in length and has a transparent body. An adult worm contains only about 959 somatic cells, which facilitates studies in cell and developmental biology. Additionally, it was the first multicellular organism to have its entire genome sequenced in 1998. The known structure of its genome provides an excellent foundation for genetic and molecular biology studies.

You can download the file related to C. elegans from here and add it to your project.

Creating a new directory

First, we create a folder named “resources” in our project.

Name of the new directory/folder to be created

We move the relevant file to this folder. This file will form the basis of our analysis.

In our project, we will use the tool tRNAscan-SE to analyze genomic sequencing. tRNAscan-SE is a computer program that identifies and characterizes transfer RNA (tRNA) genes within genetic sequences. tRNA are small RNA molecules that play a crucial role in protein synthesis, carrying amino acids to the ribosome.

https://en.wikipedia.org/wiki/Transfer_RNA

The primary functions of tRNAscan-SE are as follows:

  1. Detection of tRNA Genes: tRNAscan-SE scans DNA or RNA sequences to identify tRNA genes. This is a crucial step in the analysis of genetic sequences because tRNA genes are of critical importance for cellular functions.
  2. High Accuracy and Precision: The program offers high accuracy and precision in identifying tRNA genes. Utilizing advanced algorithms and databases, it can distinguish tRNA genes from other genetic elements.
  3. Wide Application Range: tRNAscan-SE can be used on the genetic sequences of a wide variety of organisms, from bacteria to humans, effectively identifying tRNA genes across a broad spectrum.
  4. Essential Tool for Biological Research: The analysis of tRNA genes provides fundamental insights into topics such as genetic regulation, gene expression, and protein synthesis. Therefore, tRNAscan-SE is a frequently used tool in molecular biology and genetic research.
  5. Isolation and Characterization of tRNA: The program not only detects tRNA genes but can also characterize their structural and functional properties.

For more detailed information about the tRNAscan-SE tool, visit the official website.http://lowelab.ucsc.edu/tRNAscan-SE/

We are creating an output folder for our results

We are creating an “outputs” folder for storing the results obtained with the tRNAscan-SE tool.

In our project, we will conduct analyses in two different ways: both through Python code and by obtaining results via Bash Script.

Firstly, let’s examine step by step how to do it via Bash Script.

Let’s create a folder named ‘scripts’. We will write our Bash Scripts files here.
We are creating a file named ‘tRNAscan.sh’ in our ‘scripts’ folder

As you can see in the lines below, we first declare #!/usr/bin/env bash to enable the execution of the bash file.

Then, to activate the ‘trnascan’ workspace we previously created on Conda, we first enable bash to access Conda. In my case, the file is located at /Users/pinar/anaconda3/etc/profile.d/conda.sh. You should write the path where it is located on your system.

Now, we activate the ‘trnascan’ environment by typing ‘conda activate trnascan’.

#!/usr/bin/env bash

source /Users/pinar/anaconda3/etc/profile.d/conda.sh

conda activate trnascan

tRNAscan-SE $1 -o $2

Finally, we write the tRNAscan-SE command and its parameters. Our first parameter is our source file, i.e., C_elegans.fa, and the second command is the parameter that will specify the directory where our result files will be written.

Important Note: To execute the command in the terminal section below, you need to grant permission to the file with the command:

chmod 755 scripts/tRNAscan.sh

Now it’s time to run our command file. We specify the file path with the command ./scripts/tRNAscan.sh.

The resources/C_elegans.fa part is our input, i.e., the fasta file.

The outputs/tRNAscan_results part specifies our output file.

./scripts/tRNAscan.sh resources/C_elegans.fa outputs/tRNAscan_results

When we run the command, the output will be as follows.

tRNAscan result
You can see the result file in the ‘outputs’ folder

Attention: This command line file may not automatically be added to the version control system (like Git) depending on the IDE settings. Check and make sure it is added.

Now let’s examine step by step how to do it via Python.

Optional: First, we will use Snakemake for the part we will do with Python. To use Snakemake effectively in PyCharm, we will install the SnakeCharm Plugin. Although the plugin is not mandatory, it is useful for formatting and coloring the file we will create.

To install the plugin, you can open “Preferences” by clicking on it
In this screen, you can reach the Plugins section by typing ‘plugins’ in the search area
In the Plugins section, you can find and install the relevant plugin by searching for ‘SnakeCharm’.

Before moving to the Snakefile, let’s create our env.yaml file for our environment settings.

We are creating our ‘env’ folder and ‘env.yaml’ file

We write the following codes inside it.

name: Hello_World_For_Bioinformatics
channels:
- bioconda
- conda-forge
- nanoporetech
dependencies:
- trnascan-se
We are creating a ‘tRNAscan_stats.py’ file in the scripts folder

In the Snakemake workflow, we first import the snakemake.shell module. Then, necessary sections are defined within the Snakefile. In the final stage of the workflow, the tRNAscan-SE command is executed via shell, similar to the process written in Bash Script.

from snakemake.shell import shell

genome = snakemake.input.genome
tRNA = snakemake.output.tRNA
stats = snakemake.output.stats

shell(f"""tRNAscan-SE {genome} -o {tRNA} -m {stats} """)
We are creating a ‘Snakefile’ in the main directory. The file name must be exactly this, otherwise, it will not work

We write the following lines in the Snakefile;

In the Snakefile, we start with “rule all,” where we specify our final product, “outputs/C_elegans.tRNA”. In the “rule tRNAscan” section, we take the “resources/C_elegans.fa” file and run the tRNAscan-SE program to produce the “outputs/tRNA_scan_result.txt” output. In the “rule tRNAscan_stats” section, using the same input file, we generate two outputs, “outputs/C_elegans.tRNA” and “outputs/C_elegans.stats”. In this rule, we use two threads and specify a Conda environment for this process. Finally, we use our Python script named “scripts/tRNAscan_stats.py” for this task.

rule all:
input: "outputs/C_elegans.tRNA"

rule tRNAscan:
input: "resources/C_elegans.fa"
output: "outputs/tRNA_scan_result.txt"
shell: """tRNAscan-SE {input} -o {output} """

rule tRNAscan_stats:
input:
genome= "resources/C_elegans.fa"
output:
tRNA= "outputs/C_elegans.tRNA",
stats= "outputs/C_elegans.stats"
params:
threads= 2
conda:
"env/env.yaml"
script:
"scripts/tRNAscan_stats.py"

To run the Snakemake workflow, we use two main commands: “Dry run” and “Run”. The “Dry run” command (snakemake — jobs 1 -c1 — use-conda — printshellcmds -n) is used to see how the workflow will operate without performing any actions; it just shows the planned processes. The “Run” command (snakemake — jobs 1 -c1 — use-conda — printshellcmds) actually executes the specified workflow, applying the processes and producing results. In both commands, — jobs 1 -c1 indicates that one thread will be used for the process, — use-conda indicates that Conda environments will be used, and — printshellcmds displays the executed shell commands.

#Dry run
snakemake --jobs 1 -c1 --use-conda --printshellcmds -n

#Run
snakemake --jobs 1 -c1 --use-conda --printshellcmds
The first part of the output of our snakemake command
The last part of the output of our snakemake command
You can see the prepared files in the ‘outputs’ folder

We’ve reached the end of our first article in the field of bioinformatics. I hope this piece has served as a guide for you in this exciting field. But this is just the beginning! In my upcoming articles, we will delve into more comprehensive and deeper topics in Bioinformatics. Don’t forget to follow me to join me on this journey and to access similar articles. I’m looking forward to sharing more knowledge and discoveries with you. See you soon!

--

--