MM and ML tags in mod-basecalled BAMs using Remora

Shloka Negi
4 min readNov 14, 2022

--

MM and ML tags are specific to BAM files, generated by any DNA mod-basecalling algorithm (Eg- Guppy, Megalodon, Bonito, etc.) using nanopore sequencing long-reads.

This blog specifically focuses on understanding MM and ML tags in BAM files generated by Remora for detecting both 5mC and 5hmC mods (DUAL mode)

Brief Background

Remora

Remora was launched by Oxford Nanopore Technologies (ONT) in 2022, where they claimed it to provide the highest accuracy methylation detection from PCR-free long-read nanopore data. Due to a relatively simple training dataset, Remora could detect infrequent mods like 5hmC as well with high accuracy.

Remora has 2 modes:
1. SINGLE mode — Detects 5mC mods only
2. DUAL mode — Detects 5mC and 5hmC mods both

5mC and 5hmC modifications

DNA mods: 5mC and 5hmC

5-methylcytosine (5mC) and 5-hydroxymethylcytosine (5hmC) are common DNA modifications and often called the 5th and 6th base respectively, since 5hmC is the second most common DNA mod after 5mC. 5hmC is an oxidation product of 5mC.

Explanation of MM and ML tag formats in mod-basecalled BAM

Check detailed documentation for MM and ML tags here(Section1.7).
I will explain these tags through an example using modBAMs, generated from T2T-CHM13 nanopore sequencing data; Mod-basecalling using Guppy basecaller with Remora (mode SINGLE & DUAL).

1. Number of ML values for a given read in DUAL mode = 2*(Number of ML values for a given read in SINGLE mode)
2. MM values for a given read in DUAL mode = MM values for a given read in SINGLE mode

For example -

This is what actual MM and ML tag values would look like for a given read in DUAL mode and SINGLE mode.

For MM:

Represents positions of nucleotides (here Cytosine) which might be modified.

C+h: This means Cytosine [C] on the forward strand [+] is modified to 5hmC [h].
C+m: This means Cytosine [C] on the forward strand [+] is modified to 5mC [m].

Each value denotes the number of Cs to be skipped to reach the potentially modified C. For example — C+h:1,2,0,….. means — the 1st C is unmodified (value 1), the 2nd C is modified, 3rd and 4th Cs are unmodified (value 2), 5th C is modified and 6th C is modified (value 0) and so on….

What’s interesting to note here is that in DUAL mode outputs, C+h and C+m represent the exact same values. This means that the Cytosines predicted to be modified are the same, but the probability values (ML) would help determine whether each of these cytosines are 5mC mods, 5hmC mods or unmodified.

Also, the MM tag from the SINGLE mode output gives the same exact values as those given by DUAL mode, for a selected read. This tells us that MM positions remain the same irrespective of using the SINGLE or DUAL mode.

For ML:

Represents the probability of each modification listed in MM being correct, in order of their occurrence {range: [0,255]}

In DUAL mode, we see double the number of ML values as seen in SINGLE mode. This is because:

The first half of ML values correspond to C+h MM values.
The second half of ML values correspond to C+m MM values.

The following figure illustrates it more clearly… I have divided the ML column into 2 parts for easy visualization.

As we can see for DUAL mode, corresponding to MM value = 1, there is a 253/256 chance [red] of that C being a 5hmC and a 0/256 chance [green] of that C being a 5mC; Which means a 3/256 chance of that C being unmodified.

The following figure is an even better visualization….

Thus, understanding MM and ML tags is particularly important when using remora DUAL mode for the detection of both 5mC and 5hmC modifications to make sure downstream tools work efficiently.

Hope this was useful…..

Follow for more…
@shlokanegi
GitHub
@shloka-negi LinkedIn
@shloka_negi TopMate

--

--

Shloka Negi

Pursuing Ph.D. in Biomolecular Engineering and Bioinformatics from the University of California, Santa Cruz | IIT-BHU (Batch 2022) President Gold Medalist.