SAM files processing and variants calling in bacterial genomes

Processing and analysis of SAM files is a key step in many bioinformatics pipelines. It is essential for both biomedical and comparative bioinformatics. As we explained before, SAM files contain coordinates of reads aligned (mapped) to a reference sequence. Reference sequences can be a single assembled genome or a set of contigs or scaffolds. SAM files can be produced by various read mappers — for example, Bowtie2, BWA, BBmap. In this tutorial, we continue to work with the drug-resistant strains of Mycobacterium tuberculosis and demonstrate how to post-process, sort and clean SAM files with mapped reads and do basic variant (SNP) calling with varcall from ea-utils.

A SAM file is a TAB-delimited file. It has header lines, which start with the ‘@’ symbol, and a tag (HD, SQ, RG, PG), and alignments. Each alignment has the following columns:

  1. QNAME — Query template/pair NAME
  2. FLAG — bitwise FLAG
  3. RNAME — Reference sequence NAME
  4. POS — 1-based leftmost POSition/coordinate of clipped sequence
  5. MAPQ — MAPping Quality (Phred-scaled)
  6. CIGAR — extended CIGAR string
  7. MRNM — Mate Reference sequence NaMe (‘=’ if same as RNAME)
  8. MPOS — 1-based Mate POSistion
  9. LEN — inferred Template LENgth (insert size)
  10. SEQ — query SEQuence, on the same strand as the reference
  11. QUAL — query QUALity (ASCII-33 gives the Phred base quality)
  12. OPT — variable OPTional fields in the format TAG:VTYPE:VALUE

It is important to understand that Bowtie2 and analogous bioinformatics tools retain the same order of reads in the SAM file as they appear in the original FASTQ/FASTA files used in mapping. Nevertheless, it is possible to analyze SAM files much faster if reads are listed in the order of coordinates in the reference genome. The process of reads re-ordering is called sorting. So, the raw SAM-files need sorting. Remember, most downstream bioinformatics tools take only sorted SAM/BAM files. Sorting is done based on column 3 (RNAME) and 4 (POS) of the SAM file (so, the reference sequence name and a position). When sorting is defined, a SAM file will also contain @SQ lines defining the exact order of sorting.

Once reads are sorted, the mapped reads can be cleaned of reads of low quality. For instance, we may want to discard reads that weren’t aligned to any coordinates of a reference genome or alternatively, filter reads based on the quality of the alignment.

We will use a toolset called samtools to carry out sorting and cleaning. Once such pre-processing is complete, clean SAM files can be used to produce so-called vcf-files. VCF files contain information about all varying positions (SNPs) in a sample, relative to a reference genome. In this tutorial, we will use a tool called varcall from ea-utils tools to carry out SNP calling from the SAM files.

In this tutorial you will need a SAM-file and a reference genome. We will continue to work with the genetic information of drug-resistant strains of Mycobacterium tuberculosis. In the previous tutorials in this series we processed raw reads from bacterial genomes and obtained SAM-files with the coordinates of aligned reads relative to the reference genome. To follow all of our steps in the process you can check out the tutorials “NGS reads quality control and filtering” and “NGS reads alignment on reference genome”.

Read more: