Vectorization of DNA sequences

Efficient conversion of ACGT sequences into k-mer frequency vectors

Published in

The Computational Biology Magazine

5 min readMar 31, 2020

In this article, I will present you with a fast method to convert DNA sequences into vectors. We will be going through the method and code since this step should not slow down much important downstream analysis.

DNA sequences are long strings of the alphabet ACGT. Since the corpus of the ACGT alphabet increases exponentially as k⁴ with k number of letters, usually DNA sequences are represented in the forms of vectors of small k (3, 4, 5, 7, 9). Vectorization is also an important step in feeding variable-length DNA sequences into machine learning models. This is because machine learning models rely on fixed-length vectors for processing. This has gotten very popular with the increasing availability of Nanopore and PacBio reads. The efficiency of the methodology will determine the time taken for the vectorization. In a reasonable dataset, we shall expect to see at least 1 million reads. This can further increase with high depth sequencing of metagenomic samples where the role of such vectors is prominent in scenarios such as binning.

The Size of the Vector

There are k⁴ possible distinct k-mers, however, in reality, lies nearly half once we merge the reverse complements. However, it is always wise to compute them on the fly since odd-sized k-mers and even-sized k-mers might behave a bit differently. The idea is to count k-mers and record the counts in a hash-map data structure. This way we can record k-mer counts with a single read into a hash-map without changing the size of the hash-map (Changing sizes in a c++ map will be an expensive task as it uses a balanced tree. Using an unordered map can save trouble, but the insertion order can have lead to varying order which we need to avoid). However, we can easily compute the set of k-mers like this. Here we are using an optimized version of reveres complementing.

Encoding Letters and Reverse Complements

A = 1000001
C = 1000011
G = 1000111
T = 1010100

We have 4 letters to encode in binary for efficiency. We need 2 bits for this purpose. We can see that the bits in bold resembles integers 0, 1, 3 and 2 respectively. Therefore, we use the operation X >> 1 & 3 to convert letters to the integers. Note that we shift by 1 bit and take bitwise AND operation to retain only two bits. We use the following code to get the reverse complement.

I will not explain the code line by line for simplicity. However, what basically happens is we reverse the bits in blocks of 2 bits and compute complements with bit operations. Hexadecimal values are used to make the code short. This approach can only handle k-mers up to k=15, which is far more than what's usually used for vectorization.

Creating a Vector Placeholder

As discussed before, we need to have a hash-map of all the k-mers relevant. This makes the process much faster by leaving us with O(1) operations. Let’s have a look at how the code would look like.

Note that we only pick the minimum valued k-mer out of all possible k-mers.

Computing the Vectors

Now that we looked at how one might compute all possible k-mers, reverse complements and encoding, let us look at how we can obtain vectors. First, let’s have a look at the code that would convert a single string to a vector. The explanation will follow the code.

Function Parameters

Reference to the string
Size of k
Whether to normalize or not (For future TF-IDF like computations if needed)
Set of all k-mers for fast hash-map copying

Function Implementation

First, we copy the set of k-mers to a local variable and initialize the hash-map. Then we initialize the vector to hold our final vector values, which is equal to the size of the hash-map. In the first for loop we iterate the string once and perform the bit operation to record the k-mer. Note that we skip letters that aren’t ACGT. We keep a length variable to skip such windows and consider the value of val when the recorded length is the size of expected k. When the size is met, we update the hash-map and continue the same.

In the second for loop we normalize (if normalize = true) or just record the vectors in the stats variable. Now we have a vector for the string we supply.

Note that in this function, no string copy operation is performed. All computations are mathematical/bitwise which makes things much faster. This can process millions of reads within minutes or seconds if multi-processing facilities are available. This can also be easily extended to fit into the domain of GPU programming (Let me leave that bit out for now).

The Complete Implementation

I hope you had a good idea as to how we can compute the vectors efficiently using C++. However, to make things easier, I have attached the complete code for you to use or modify to fit your requirements (code supports FASTQ i.e. reads line numbers 2, 6, 10, etc.).

Compiling

g++ vectorize.cpp -o vectorize -fopenmp --std=c++17

Running

./vectorize reads.fastq K-SIZE THREADS BATCH_SIZE > vectors.txt

Complete Program

I hope this code will help you improve your research work and experimentation efficiency.

Thanks for reading. Cheers! 😊