Bioinformatics 1: K-mer Counting

A challenging yet intriguing interdisciplinary problem

Gunavaran Brihadiswaran
The Startup

--

Image by PublicDomainPictures from Pixabay

K-mer counting is an interesting yet challenging problem in bioinformatics. In this article, we’ll talk about what k-mers are, the problem of k-mer counting, its applications, and some interesting insights from the computer science perspective.

What are k-mers?

In simple terms, k-mers are substrings of length k in a given string (can be DNA, RNA, protein, or any string sequence). Since our interest is towards bioinformatics, we will converge our attention to k-mers in a DNA sequence.

Consider the DNA sequence “ACGAGGTACGA” which consists of 11 nucleotides. Let’s try to obtain all the 4-mers (substrings of length 4) in this DNA sequence.

4-mers in the sequence ACGAGGTACGA

The idea is simple. We create a window of length 4 and slide it from left to right, shifting one character at a time. If the length of the given DNA sequence is N, we would end up with N - k+1 k-mers.

Total no. of k-mers = N - k + 1

In the above example, the given DNA sequence is 11 characters long (N=11) and k = 4, thus we get eight…

--

--

Gunavaran Brihadiswaran
The Startup

A Computer Science Research Student who loves to do Research, Write and Travel