DNA STEGANOGRAPHY AND THE ROLE OF SNPS

SIAM Student's Chapter VIT Bhopal
3 min readJun 20, 2023

DNA sequences have gained significant attention as a means of storing information, solving problems, and encrypting messages using quaternary digit representation. DNA cryptography, which involves encrypting messages using DNA, has been used to hide secret messages. Clelland et al. devised a method to conceal encrypted messages by converting them into quaternary digit strings and replacing them with corresponding nucleotide sequences. These sequences, flanked by specific primer binding sites, were mixed with the fragmented human genome to obscure the secret message within the background noise provided by the genome.

To decipher the message, a specific primer set is required for polymerase chain reaction (PCR) and sequencing. However, advancements in Next-Generation Sequencing (NGS) technology have made it easier to detect secret messages hidden using this approach, rendering it unsuitable for concealing information within a genome. Therefore, a new method of DNA steganography was proposed, which involves hiding secret messages in variable regions known as single nucleotide polymorphisms (SNPs) within a genome. The encrypted nucleotide sequence resembling other DNA cryptography techniques is inserted into these SNP regions. Since SNPs occur naturally and are polymorphic, it becomes difficult to determine whether a nucleotide is an SNP or part of an encrypted message.

Humans have a higher number of discovered SNPs compared to other organisms, providing ample opportunities for DNA steganography. To facilitate this, an extensive search was conducted to identify SNPs that allow for any of the four nucleotides (A/T/G/C) at the respective positions. Pathogenic SNPs were discarded, and unique 21-nucleotide sequences around the SNPs within the human genome were selected. SNPs located within transposable elements, CpG islands, or conserved regions were also eliminated. Databases such as Dfam, the Sequence Manipulation Suite, and PhastCons were used for identifying SNPs in transposable elements, predicting CpG islands, and identifying SNPs in conserved regions, respectively. A total of 275,967 SNPs were ultimately chosen.

In theory, all SNPs can be utilized to store encrypted messages. However, current genome editing technologies, including CRISPR/Cas, are not capable of simultaneously editing multiple genomes. To overcome this limitation, SNP hotspots were identified to streamline genome modification. Regions with more than 35 SNPs within a 1 kb region were selected as hotspots. If at least two hotspots were available, the encrypted sequence could be inserted through two iterations of genetic recombination. There were five SNP hotspots meeting the criteria of having at least 35 SNPs within a 1 kb region.

To decrypt the hidden message within the genome, the user must possess the encryption table and knowledge of the SNP positions used. The decryption process was the reverse of the encryption process. Firstly, if the message was hidden within an SNP hotspot, the region could be easily sequenced since the hotspot was only 1 kb long. Then, the nucleotides at predefined SNP positions were combined to form a 1D DNA sequence. Secondly, the DNA sequence was rearranged into a 2D format, with each row containing 10 nucleotides (9 for the message and 1 for error checking). Additional parity nucleotides were employed to detect mutational changes. Thirdly, if no mutations were detected, the sequence in the main body was rearranged into a 1D sequence. Similar to mRNA translation, DNA triplets were translated into characters using the encryption table. Following these steps, the decrypted message could be obtained.

Written by,

of

--

--