File Formats Every Bioinformatician — Established or Upcoming — Must Know (and then some)

8 min readApr 1, 2023

Dwight Schrute once said, “I’m ready to face any challenge that may be foolish enough to face me.” That’s the attitude of a successful bioinformatician right there. I don’t mean Dwight, I mean me. And maybe you. Anywho, challenges in bioinformatics analysis sometimes happen when you don’t treat a file right.

The easiest way to avoid annoying challenges caused by file formats is to understand file formats. Right?

In this piece, I’ve documented some common bioinformatics file formats I’ve encountered with short descriptions and their uses. Basic knowledge of each file format is important to understand bioinformatics research papers, tutorials, other bioinformatics-related content, and bioinformatics analysis. So, I’m writing this for me and for you.

Different Bioinformatics File Formats: Their Meanings and Uses

.fasta

The two most common biological data types for bioinformaticians — DNA and protein sequences — usually exist in fasta format. Fasta file format has a signature greater than (>) sign at the beginning, followed by a description statement (or the jargony term, sequence identifier). The rest of the lines in fasta files are filled with a bunch of As, Ts, Cs, and Gs for DNA, or a combination of 20 other letters for protein. Another way to identify a fasta file without opening it is the extension — .fasta, .faa, .fna, and .ffn are the usual extensions.

You can download fasta files directly from the gene database, NCBI (National Centre for Biotechnology Information). Depending on the intended usage, one fasta file can either have one gene sequence or more. It is the simplest representation of biodata and is often the required format for downstream analysis like sequence alignment and phylogenetics analysis, especially when you’re using a local or web-based software.

.fastq

Q here stands for quality. So, fastq is an extension of the fasta file format that includes information about the quality of the sequence reads and the basecalling analysis. And yes, it exists for both DNA and protein sequences.

.fastq files, like .fasta, are also used for sequence alignments, and other larger scale analysis like variant calling.

.fast5

As biomedical research evolves, so will the kinds of files we use for analysis. The .fast5 file contains current signals from nanopore sequencing. Nanopore sequencing is a next-generation sequencing technique that generates long reads with less sequence fragments by tracking current passage through a nanopore. It’s the reigning technology in sequencing.

If you peek into this file, you’ll see a bunch of nonsense that only a computer program can understand plus a few occurrences of sequence bases. Using this file requires a process called basecalling with a package like guppy to convert the signals into sequences (and also align to reference genome sequences).

.sam

Sequence alignment map files contain information about, you guessed it, aligned sequences — sequences that have been arranged based on a reference genome. It has description lines or headers like .fasta files, and the symbol @ as its signature sign. Every line without the sign gives information about the sequence like the mapping quality, reference genome, and size of the sequence.

.sam files are human readable, which means, you can open them on a command line or a notepad on your PC, and understand what stares back at you to an extent. The downside is, in real-life research, they’re pretty heavy. So, to save space and make our analysis more efficient, we save them in another format — .bam.

.bam

A bam file is the compact and binary version of the corresponding sam file. If you open this, you’ll see a bunch of random symbols that make absolutely no sense to you. You can still view its content though, on the command line using a software package called `samtools` (the sam is not a coincidence).

.bam files often have a companion file with .bai extension. .bai files always have the same name as the corresponding .bam files, and serve as a table of content that allows computer programs to easily access the content of .bam files; they don’t have any sequence information. In other words, without a corresponding .bam file, .bai is useless. If you want more insight about how to view bam files, there’s a tutorial on working with .bam files by NCBI.

.bed

A BED file is a widely-used file format in bioinformatics research for representing genomic intervals or regions. It is tab-delimited and easy to read as a human. The most common columns are: “chromosome,” “start position,” and “end position.” Depending on the intended use and the input command, it can have other optional “additional information” columns.

In a BED file:

Chromosome: This column specifies the chromosome or sequence identifier where the genomic region is located.
Start Position: This column represents the starting position (base pair) of the genomic region on the specified chromosome.
End Position: This column indicates the ending position (base pair) of the genomic region on the specified chromosome.
Additional Information: This optional column can contain any additional data related to the genomic region, such as the strand information, score, gene names, or functional annotations.

We use BED files for genome annotation to define or label a region as an exon or promoter, for instance; variant calling to specify regions of interest during the analysis; and also genome browser visualisation.

Bottomline, if you see a file with .bed extension, spend some time looking through to understand what you can do with it.

.vcf

Variant is the fancy biological term for differences. Variant call format is a text file that stores genetic sequence variations after running a comparison analysis. For instance, if you’re trying to find the gene differences between a tumor genome and a normal genome, you can perform a variant calling analysis and store your result in a .vcf file.

.vcf files are human readable, so you can easily view the content with a notepad on your PC. A .vcf file has metadata (or description or header lines, whatever tickles your fancy) with the symbol # as its signature sign. These lines can be up to 10, telling you about the columns and what they stand for. Every other line tells you the position of the variant, the quality of the variant, and type of variation. The easiest way to understand the abbreviated letters is to read the metadata.

.gtf/.gff

Apart from the letters of DNA and protein sequences, other forms of biological data exist. Feature or annotation data provides more information about genes, DNA clusters, and other sequence combinations. It helps to make sense of genetic data obtained from sequencing and other downstream processing; it’s like a guide to data interpretation. It stands for gene transfer format or general feature format.

GFF files are more robust than gft, but they serve the same purpose. They have about nine columns with information about chromosome location, gene name, gene biotype, source, strand, width, and others. If you have a file of chromosome coordinates for instance, you can use a gtf or gff file — which is always freely available on bio databases — to determine what kind of information exists in that region like genes, promoters, enhancers, or whatever else the sequence holds.

.txt

Okay, I know it’s not a bioinformatics file per se, but we use it a lot, so it might as well make the list. Text files are the default output format for analysis that don’t have a specific output file format.

For instance, using the `samtools` package gives you either a .sam or .bam file, or another specified format like .pileup. But running a simple manipulation command like cutting out columns (with `cut`) or picking specific patterns (with `awk` or `grep`) usually produces a standard output (displays on the screen) that you might want to save for future use. That’s where .txt comes in.

The good thing about text files is that they are easy to view and manipulate (like convert to other file formats).

.gz

This isn’t necessarily a bioinformatics file format, but more of a storage file format. It’s the compressed or zip format for any kind of biological data. As I’ve mentioned, biodata is heavy, in real-life instances, and therefore, much easier to handle when compressed. Unzipping files is straightforward and convenient, both on the command line and on a PC, so there’s no reason not to.

.pdb

You didn’t think I’d only talk about sequence data, did you? Protein data exist as amino acid sequences and visual structures (2D or 3D) in the protein data bank. The 3D structures are stored and retrieved in .pdb format and often used for molecular analysis like computational drug discovery.

More file formats exist, like .json, .csv, and .xml, but those are more computational than biological, so I’ll leave them out. Being familiar with file formats is one crucial part of being a bioinformatics scientist, because it gives you the first clue about what has been done to the data, what you can do with it, and what you may need to use it.

The images I’ve attached are directly from a command line, and is not always a nice way to view and understand certain files, but you get used to it with time. If you’re familiar with R and RStudio, they can be useful tools to give a better view of the data, especially those that you can read in as dataframes, like .gtf and .gff.

Conclusion

These file formats only exist because we keep finding new ways to simplify our jobs and use newly developed software packages. All I can say is, this is nowhere near the end of file formats we’ll be using for bioinformatics analysis, so I will definitely keep updating this list as I find more. Thanks for reading and letting me plant my seed inside you (get it? It’s another Dwight Schrute reference).

If you’ve used a peculiar file format that I’ve skipped, let me know. See you in the comment section or my next article.