Guide to Bioinformatics File Formats

A Beginner’s Primer

Eric Gathirwa
Jenomu Bioinformatics
4 min readJun 18, 2024

--

In our recent articles, we addressed how to bridge the gap between dry and wet labs and which programming languages are the cornerstone for a beginner in bioinformatics. We provided some resources for that, so check the articles out.

Photo by Catrina Carrigan on Unsplash

After you have learned some key basics in bioinformatics programming and are ready to perform some analysis, you may realize you have no idea of the file types you will be dealing with at various stages of the analysis. To better put it, think of it as knowing how to drive but having no idea what different road signs mean or which side of the road you are supposed to drive on. I don’t think I need to explain how that will likely turn out…

In this short article, I will briefly explain the main bioinformatics file formats and the role each plays in bioinformatics analysis.

Sequence and Quality Files

First, let’s set the stage with the raw materials. DNA and RNA sequencing generates massive amounts of data stored in FASTA and FASTQ files. Think of FASTA files as the script of a play, listing the sequences of events, in this case, the DNA or RNA bases. FASTQ files add another layer, including quality scores for each base in the sequence, ensuring the accuracy of the data — more like information on how the execution of each line in the script contributes to the delivery of a flawless performance. These files are generated right after sequencing in the wet lab and serve as the foundation for all downstream analyses.

Mapping the Data

Bioinformaticians use software to map the sequences to a reference genome once they have the sequences. In line with our driving analogy, this is like driving in line with the GPS on our car dashboard instructing us where to drive. This process creates SAM (sequence alignment map) or BAM (binary alignment map) files, the digital equivalent of actors blocking out their movements on stage. SAM is a human-readable format with detailed information about each sequenced fragment. BAM is a compressed version that is ideal for storage and large-scale analyses. Alignment takes place after sequencing and is crucial for identifying gene variations.

Variant Calling: Unveiling the Nuances

The play’s not complete without its twists! Some actors tend to improvise, affecting the flow and meaning, and the director must address these changes. While some modifications are acceptable and maintain the play’s theme, others distort the intended meaning, requiring further attention. Variant calling format files (VCFs) come in after alignment, highlighting variations in the bases between the sequenced DNA and the reference genome. Back to our script analogy, imagine the director making notes on these discrepancies in the script. VCF files are essential for studies that focus on genetic variation and disease.

Behind the Scenes: Annotations and Metadata

Metadata and annotations take the play a notch higher. Think of them as annotations describing the function of different genes and elements within the sequence — like identifying the actors’ roles and stage directions. These files can be generated throughout the analysis pipeline and are crucial for interpreting the biological meaning of the data. General feature format (GFF) files and gene transfer format (GTF) files act as the behind-the-scenes crew, providing additional information about the data.

The Show Must Go On: Sharing and Collaboration

Finally, the results must be shared! Many file formats, like FASTA, BAM, and VCF, are widely used and easily shared between researchers. This allows for collaboration and reproducibility of findings, like sharing the script with other theatres for future productions.

These are just the tip of the iceberg; the world of bioinformatics file formats is vast, but understanding these essential files is a significant first step. Remember, bioinformatics thrives on collaboration, and just like a well-rehearsed play, success comes from the seamless integration of the wet lab’s data generation with the bioinformatician’s analysis skills.

At Jenomu Bioinformatics, we nurture this integration between researchers and bioinformaticians by providing a platform for skilled bioinformaticians to apply their skills to impactful biological projects while allowing researchers to access a large pool of skilled bioinformaticians to help them meet their analysis needs. Visit jenomu.com to learn more and be part of the bandwagon.

Also, if you are new to bioinformatics and would like a ‘less intimidating’ introduction to bioinformatics and computational biology, we have you covered! You can follow our series of articles on Medium at https://medium.com/jenomubioinfo.

We look forward to partnering with you on your bioinformatics journey!

--

--