The Arrival of DNA Data Storage

Sparsh Gupta
CodeX
Published in
8 min readJul 17, 2021

Computation is a fundamental primitive of information processing carried out by computers, which requires sing, and retrieving of data. When these computations are performed by biomolecules such as DNA & proteins, it gives rise to ‘biomolecular computing’. One can classify it into DNA, RNA, Bacterial, Membrane & Peptide computing. One of the most popular emerging areas is DNA computing, where computation is carried out using DNA strands. We know that, in a classical computer, a computational problem is solved by circuits in a computer that provides an output in the form of binary strings of 0s & 1s. Whereas in DNA computing, the computational problem is introduced into DNA strings of A, C, G, and T, and the computation is carried on in test generating an output in the form of DNA strands as well. When explored, DNA computing applications lie in the areas of DNA Data Storage Systems, DNA Tiles, DNA Nanostructures, DNA Circuits, DNA Strand Displacement, and DNA Cryptography, possessing interesting mathematical problems. Computations in all areas require a crucial thing to work on, which is data. Due to this, we have been producing so much data that it’ll be impossible to handle a data explosion crisis in the near future. For example, since 2016, humans have produced more than 90% of the world’s current total data, and we create a massive 2.5 quintillion (10¹⁸) bytes of data each day. A nice estimation for a data explosion is given by Eric Schmidt:

“From the dawn of civilization until 2003, humankind has generated five exabytes of data. Now we produce five exabytes every two days and the pace is accelerating.” — Eric Schmidt, Former-Chairman Google, August 2010

We see that these data figures are surging up continuously, and so the question to ask is, how big is this data? A terabyte (10¹² bytes) can fit 200,000 photos in a hard disk, whereas an exabyte (10¹⁸ bytes) would take up 2000 cabinets and pack a 2-story data center occupying a whole city block. Finally, a yottabyte (10²⁴ bytes) will fill the US states of Delaware & Rhode Island with a massive amount of million data centers. What about the cost of storing data? One can buy a 1 terabyte hard drive today for US$ 50. So, to buy a yottabyte of storage would cost a tremendous US$ 50 trillion. To compare this estimate, we see that the GDP of the USA was around US$ 21 trillion in 2019. Clearly, storing a yottabyte would not be viable as it costs more than some of the biggest economies of the world! We conclude that storage space and cost-burning issues that need to be solved.

The next-generation storage device that comes into play is DNA. 1 gram of DNA can store up to 455 exabytes of data for 1000s of years. The current size of the internet is approximately 700, and therefore, 2 grams of DNA is sufficient to store the whole internet, which is a remarkable thing.

The working of a DNA Data Storage System begins with encoding binary data into DNA strings. Then, this DNA is synthesized in a lab and stored in test tubes. Eventually, this data can be retrieved from test tubes and then sequenced. Ultimately, this data can be decoded from DNA strings into binary strings. (See Fig.1) We should know that DNA synthesis is simply the writing or creation of DNA molecules, and DNA sequencing is the process by which we can read or sequence the DNA nucleotides. In this process, however, during reading, writing, and storage of data, errors like base substitution, deletion & insertion do happen, and so coding theory plays its role here to correct the errors.

Fig.1: The working mechanism of a DNA Data Storage System.

The first major DNA data storage experiment, which successfully encoded a 65, was done by George Church ¹ and colleagues in 2011. In 2013, Goldman et al. ² proposed a model that encoded data of over five million bits with the use of an error-correcting scheme. Multiple algorithms to store data on DNA were developed later on, but we’ll be looking at an early one of them, which developed a memory-efficient version of the Goldman model ³. Here’s an example to encode the string NATURE as a text file into a DNA string:

Fig.2: The encoding of the string NATURE into ternary codewords is demonstrated.

In Fig.2, we start with encoding the text into its ASCII value. ASCII, a character encoding scheme for electronic communication, is used to encode numbers (0–9), English alphabets (upper & lower-case), and special characters into values spanning from 0–255. We see that any computer file can be transformed into a list of ASCII values. Then, this ASCII value is converted into a Ternary subcode (trit) consisting of base 3 encodings (0s, 1s & 2s), with each ASCII value being assigned a length 11 ternary codeword (see Table 2 in ³ ). The complete ternary codeword for NATURE (a string of length 66) is called G1. We also take the ternary codewords for separators “.” and “:” to store them as strings G2 and G4, respectively. Subsequently, both the file extension “txt” and the file size of G1 are also encoded as a trit. (See Fig.3)

Fig.3: The ternary codewords for separators, file extension, and the file size.

Note that the total combined length of strings G1, G2, G3, G4, G5 is 143. Now, we need to add 55 zeros to this string such that the total length is divisible by 99. (See Fig.4)

Fig.4: The total length of the file & the padding of zeros are exhibited.

In Fig.6, we now obtain a final string G7 of length 198 consisting of ternary subcodes. To store this string in DNA, we now convert G7 into a DNA string G8 using the table in Fig.5.

Fig.5: The algorithm table used to convert the ternary codeword string into a DNA string.

The conversion works in such a way that the previous nucleotide is written, and the next ternary codeword to encode is identified. For example, we have the base C in the former and the trit 1 in the latter. So, using the table, we can obtain the corresponding base, which will be T here. Therefore, the latter trit 1 is now converted into the nucleotide T. This process continues in the same fashion until we obtain the complete final DNA string. (See Fig.6)

Fig.6: The conversion of the ternary codeword string into a DNA string.

In Fig.7, the DNA string so obtained is now cut into two chunks, each of length 99. These two chunks are subsequently affixed with chunk identifiers CGTA and CGAG, respectively. We now acquire the final encoded DNA strings!

Fig.7: The two DNA chunks are shown.

The field of DNA data storage possesses many interesting mathematical problems. For example, the map described in Fig.5 can be extended to different alphabets such as binary ⁴, quarternary, etc., or more generally for other algebraic structures such as finite fields and finite rings. We need to make sure that the map preserves the properties of non-homopolymers, DNA strings-structures, and other biological that errors in DNA storage can be minimized. Providing secure storage of DNA strings is also an interesting mathematical challenge. Looking for different encoding methods such that both the maximum physical density and theoretical data density of DNA data storage can be achieved is also important. Many error models and encoding schemes have been studied by many researchers.

Some Recent Developments

In March 2019, Microsoft was successful in encoding digital information into DNA and back into bits again in collaboration with the University of Washington. The machine made using glass bottles of chemicals and a sequencer was able to store and retrieve the word “hello”, a mere 5 bytes of data. Moreover, this process took 21 hours and yielded 1 mg of DNA approximately. This exemplifies that we are still far away from achieving a billion-times faster and super-compact method of DNA data storage.

Startups like Twist Bioscience, Catalog, Helixworks, DNA Script, etc. are leading the pack to create a commercial method of DNA data storage. Microsoft as well is a game changer in this field. Also, companies like Evonetix, Molecular Assemblies, Iridia, Kilobaser, Synthomics, etc. are working in this field. Recently, in late 2020, Illumina, Twist Bioscience, Microsoft, and Western Digital formed an Alliance to develop an industry and revolutionize the commercialization of DNA Data Storage.

The current sequencing technologies like Sanger sequencing, a method that uses an enzyme, DNA nucleotides, Dideoxy nucleotides, and a primer to sequence DNA, have limitations such as high costs, low output, and are not compact. Moreover, next-generation sequencing techniques are better but are still not very capable. In recent times, Nanopore DNA sequencing has come into significant development which is capable of sequencing long DNA fragments in insignificant time. This technology, developed by Oxford Nanopore Technologies, is low-cost and ultra-compact such that DNA can be sequenced in a device as small as a pen drive. Further, Nanopore sequencing does not require PCR (Polymerase Chain Reaction), a lab technique used to make copies of DNA fragments, or chemical labelling and hence is a very viable technique to be used in the future.

Some recent headlines also explore the possibility of storing data on a bacterial-based data storage. Excitingly, one would be be able to store data in their stomach on gut bacteria!

Also, not long ago, a digital movie was successfully stored in the genomes of a living population of bacteria using CRISPR-Cas. CRISPR is a gene-editing technology that can be used to alter DNA sequences. It can be easily understood as a pair of molecular scissors capable of changing the function of a genome.

For more insights into the field, one can look at ⁵, ⁶, and ⁷.

Finally, we end with a quote by Adleman inviting researchers to dive into the ocean of this emerging field.

“Biology and computer science — life and computation — are related. I am confident that at their interface great discoveries await those who seek them” — Leonard Adleman, Scientific American, August 1998

References

¹ George M. Church, Yuan Gao, and Sriram Kosuri, Next-generation digital information storage in DNA, Science 337 (2012), no. 6102, 1628-1628.

² Nick Goldman, Paul Bertone, Siyuan Chen, Christophe Dessimoz, Emily M. LeProust, Botond Sipos, and Ewan Birney, Towards practical, high-capacity, low-maintenance information storage in synthesized DNA, Nature 494 (2013), no. 7435, 77-80.

³ Dixita Limbachiya, Vijay Dhameliya, Madhav Khakhar, and Manish K Gupta, On optimal family of codes for archival DNA storage, 2015 seventh international workshop on signal design and its applications in communications (iwsda), 2015, pp. 123-127.

⁴ Krishna Gopal Benerjee, Sourav Deb, and Manish K. Gupta, On conflict free DNA codes, Cryptography and Communications 13 (2021), no. 1, 143-171.

⁵ S. M. Hossein Tabatabaei Yazdi, Han Mao Kiah, Eva Garcia-Ruiz, Jian Ma, Huimin Zhao, and Olgica Milenkovic, DNA-based storage: Trends and methods, IEEE Transactions on Molecular, Biological and Multi-Scale Communications 1 (2015), no. 3, 230-248.

⁶ Luis Ceze, Je Nivala, and Karin Strauss, Molecular digital data storage using DNA, Nature Reviews Genetics 20 (2019), no. 8, 456-466.

⁷ DNA Data Storage Alliance, Preserving our Digital Legacy: An Introduction to DNA Data Storage, 2021. Published June 2021, Available online at https://dnastoragealliance.org/publications/

--

--

CodeX
CodeX

Published in CodeX

Everything connected with Tech & Code. Follow to join our 1M+ monthly readers

Sparsh Gupta
Sparsh Gupta

Written by Sparsh Gupta

Computing @ Olin College of Engineering. Math & Econ @ Cambridge University. https://www.sparshgupta.org

Responses (1)