The pink smear in the test tube is DNA, synthesized to store digital data for long-term storage. Microsoft used the same technique to store ~200 megabytes of data. Photograph by Tara Brown | University of Washington

Storing data at the tip of a pencil

Evangelos Marios Nikolados
16 min readApr 7, 2019

--

Synthetic Biology seeks to use and expand the mechanisms that control biological organisms using engineering approaches (1,3). Bringing the engineering paradigm to biology allows to apply existing biological knowledge in a much more rational and systematic way than has previously been possible. Design principles such as modularity of parts, standardization of parts and devices, and adaptation of available abstract design procedures are applied on all scales of biological complexity (17,18).

Figure 1. Scales of biological complexity and Synthetic Biology. From the basic units (design and synthesis of novel genes and proteins, expansion and modification of the genetic code), to novel interaction between these units (regulation mechanisms, signal sensing, enzymatic reactions), to novel multi-component modules that generate complex logical behaviour, and even to completely or partially engineered cells.

In this sense, Synthetic Biology is not primarily a “discovery science” (that is, concerned with investigating how nature works), but is ultimately about a new way of making things. Adapting biological mechanisms to the requirements of an engineering approach, increases tremendously the possibilities for assembling biological systems in a designed way (3,18).

This short essay will focus on the intersection of Synthetic Biology and Information Storage. The need for new storage media is increasing. Within this context, the emerging technology of DNA-based data storage will be discussed, along with some of the most notable works in the field. The second part will evaluate the potential of DNA to address the three most important challenges of storage capacity. Finally, in the third part, the core limitations of DNA storage will be discussed.

I. SynBio meets Information Storage

Information storage has been a fundamental requirement for human society since ancient times. In the modern era of computing and communication, all digital data worldwide is forecast to grow to over 175 zettabytes in 2025 from 33 zettabytes in 2018 (14). Consequently, there is a pressing need for dense storage media which are cost effective. According to a research report by Mordor Intelligence (23), global enterprise storage market for highly scalable, fast and economical storage solutions is expected to witness a mean annual growth rate of 34.11%, to reach a total value of $19.36 billion by the end of 2023, from a previous worth of $3.32 billion in 2017.

Figure 2. Annual Size of the Global Datasphere (13).

Some of the factors driving the growth of this market are growing demand for data storage in Small to Medium Enterprises (SMEs) and increasing proliferation of smartphones, laptops, tablets and Internet of Things (IoT) devices (26). Rising demand, however, is mainly attributed to the increase in data volume and complications of data storage along with the rapid advancements in semiconductor technology and falling prices of flash memory (20,23,26). In this respect, driven by the promise of data analytics, enterprises have started gathering vast amounts of data from diverse data sources (14, 20). Nonetheless, as recent white paper by Intel points out, nearly 80% of data is “cold”, or infrequently accessed, and is growing 60% annually, making it an ideal candidate for archival using cheaper storage devices (15). Unfortunately, the rapid rate of data growth overshadows density improvement in traditional storage hardware like HDDs and tape.

This leads to a situation, in which demand for storage (and hence storage costs) rises with a rate faster than the decreasing unit cost of storage. As a result, enterprises are presented with the challenging need to manage storage more efficiently if they are to exploit new forms of information such as more detailed financial data, digital images in life sciences or video in media companies. It is a daunting challenge for IT managers to combat the rising volume, cost, and complexity of storage. They must make decisions on a wide range of issues about storage architecture, design, and performance.

DNA-based Data Storage

Imagine fitting the world’s data on the tip of a pencil — this compelling thought that elicits a puzzled look during casual conversation, motivated several researchers to use the long considered synthetic DNA sequences as potential medium for digital data storage (5, 6, 7,12). The resulting technology is a system in which, artificial DNA made using commercially available oligonucleotide synthesis is used for storage of the data and DNA sequencing for retrieval of the data.

Figure 3. Generic scheme of DNA-data storage.

This type of storage system possesses four key properties that make it a more attractive possibility than current magnetic tape or hard drive storage systems:

  • DNA is an extremely dense, three dimensional storage medium with a theoretical ability to store 455 Exabytes in 1 gram; in contrast, a 3.5” HDD can store 10TB and weighs 600 grams today.
  • DNA can last several centuries when stored at standard temperature in an anoxic, anhydrous atmosphere (2, 11); HDD and tape have life times of five and thirty years respectively.
  • It’s easy, quick, and cheap to perform in-vitro replication of DNA, while tape and HDD have bandwidth limitations that result in hours or days for copying large EB-sized archives.
  • Finally, there is a difference in the rate at which media density improves every year, also known as Kryder’s rate. For HDD and tape, Kryder’s rate is around 10% and 30% respectively (4, 10, 24). This means that, if one stores 1PB in 100 tape drives today, it would be possible to store the same data in just 25 drives, within five years. Since storage space is a premium commodity in datacenters, using tape for archival storage implies constant data migration with each new generation of tape (read here, how this practice can can easily run into the millions for a large film archive). DNA does not have this problem as the Kryder’s rate for DNA is theoretically zero given that the density of DNA is biologically fixed.

It should be noted however, that the technology behind DNA-based storage systems is a slow process, since DNA needs to be sequenced in order to retrieve the data. Hence, DNA-based data storage is considered a type of auxiliary storage, i.e. a method intended for uses with a low access rate, such as long-term archival of large amounts of data.

Figure 4. DNA-based Data Storage as the deepest level of storage hierarchy (12).

Notable Works

No actual progress was made until 2012 with the prolific work of Church and his team at Harvard Medical School (6). They developed a scheme that encoded digital information utilizing next-generation DNA synthesis and sequencing technologies. Initially, they converted an HTML-coded draft of a book that included 53,426 words, 11 JPG images and 1 JavaScript program into a 5.27megabit bit stream. Then, they encoded these bits onto 54,898 oligonucleotides of length 159 nucleotides each. Every oligonucleotide contained a 96-bit data block, a 19-bit address specifying the location of the data block in the bit stream, and flanking common sequences for amplification and sequencing.

To read the encoded book, they used limited-cycle PCR and then sequenced them. Finally, using only reads that gave the expected length (115 nucleotides) and perfect barcode sequences, they generated consensus at each base of each data block at an average of ~3000-fold coverage. Nevertheless, this method still had a high error rate, due to the existence of homopolymers at the end of the oligonucleotides, where there was only single sequence coverage.

Figure 5. Goldman et al. (12) digital information encoding in DNA scheme. Part of Shakespeare’s sonnet 18 (a, in blue) was converted to base-3 (b, red) using a Huffman code that replaces each byte with five or six base-3 digits (trits). This in turn was converted in silico to DNA code (c, green). Replacing each trit with one of the three nucleotides different from the previous one used, ensured no homopolymers were generated. This formed the basis for a large number of overlapping segments of length 100 bases with overlap of 75 bases, creating fourfold redundancy (d, green and, with alternate segments reverse complemented for added data security, violet). Indexing DNA codes were added (yellow), also encoded as non-repeating DNA nucleotides.

In the subsequent year, Goldman and his group overcame this limitation (12). Like Church, they chose a range of common formats (ASCII text, PDF, JPEG 2000, and MP3) to be encoded, in order to emphasize the ability to store arbitrary digital information in DNA. The bytes comprising each file were represented as single DNA sequences with no homopolymers. To reduce the probability of systematic failure for any particular string, each sequence was split into overlapping segments, generating fourfold redundancy, and alternate segments were converted to their reverse complement. In all, the files were represented by a total of 153,335 strings of DNA, each comprised of 117 nucleotides. In the decoding phase, error-detection bases and properties of the coding scheme allowed them to discard strings containing errors. Full-length DNA sequences representing the original encoded files were then reconstructed in silico and the sequences were decoded.

The main drawback of the aforementioned schemes is that one has to reconstruct the whole text in order to read or retrieve the information encoded even in a few bases. To address this problem, in 2016 a team of scientists from the University of Washington, in collaboration with the research team of Microsoft, proposed a plan for a full-fledged DNA-based archival storage system using an architecture that enabled random access to data blocks and rewriting of information (5). The rationale behind this approach is that each block is equipped with an address that will allow for unique selection and amplification via DNA sequence primers.

Figure 6. DNA Fountain encoding (7). (Left) Three main algorithmic steps. (Right) Example with a small file of 32 bits. For simplicity, we partitioned the file into eight segments of 4 bits each. The seeds are represented as 2-bit numbers and are presented for display purposes only.

Finally, on March 2017, the work of Yaniv Erlich and Dina Zielinski approached even more the theoretical information capacity of DNA molecules (7). Their strategy, dubbed DNA Fountain, utilized the Luby Transform, a relatively recent addition to Coding Theory, that packages data into any desired number of short messages, called droplets. A user can recover the file by collecting any subset of the droplets as long as the accumulated size of droplets is slightly bigger than the size of the original file.

Below, a compilation of different tables and data from the aforementioned works is presented based on date of publication:

  • Coding potential: maximal information content of each nucleotide before indexing or error correcting;
  • Redundancy: the excess of synthesized oligos to provide robustness to dropouts;
  • The presence of error correcting/detection code to handle synthesis and sequencing errors;
  • Net density: input information in bits divided by the number of overall synthesized bases (excluding adapter annealing sites);
  • Physical density: ratio of the number of bytes encoded and the minimal weight of the DNA library used to retrieve the information.
Table 1. Comparison of DNA storage coding schemes and experimental results.

II. Meeting the Demand for Data Storage

DNA’s inherent properties make it an attractive possibility for storing information. The next step, is to assess its potential to address capacity, the major concern in the demand for data storage. According to James Kaplan, an expert in IT infrastructure and core member of McKinsey Quarterly, three of the biggest problems with data storage capacity are (16):

  • wasted space from stale data, such as duplicates, data with low re-reference rates or orphan data;
  • placing data on inappropriate storage configurations;
  • and, adding storage capacity.

Too Many Copies of Data

In many cases, companies find it quicker and easier to back up information by making multiple copies (16). But across a large organization, this approach increases storage volumes dramatically as companies keep too many copies of business data. For instance, based on the chosen data storage configuration, an enterprise (e.g. an electronics manufacture, a pharmaceutical company, etc.) requires a certain number of replicas (copies of data) to meet its legal and other obligations. Hence, in the case of an 11-terabyte database, keeping multiple replicas could result in a tenfold increase of raw storage. In thinking DNA-based data storage systems, creating copies of data would translate in creating duplicates of the DNA sequences using PCR. A useful visualization comes from figure 11 of the Supplementary Information in Erlich and Zielinski (7).

Figure 7. Exponential amplification process creating a deep copy of a DNA-encoded 2MB file.

In the figure above, each PCR reaction generates enough volume for multiple subsequent reactions. The first reaction uses only 10ng from the master pool of 3ug of DNA. Then, 9 subsequent rounds without cleanup are performed, using only 1ul out of the 25ul generated by each previous reaction. Therefore, an exponential process could amplify the number of copies by 25 times in each reaction (gray tubes). According to Elrich and Zielinski, this procedure can theoretically create 30 x 259 x 2 = 228 trillion copies.

Placing Data on Inappropriate Storage Configurations and Adding Storage Capacity

Each organization’s needs are different. Most have varying kinds of data that require different forms of storage and processing to maximize productivity. Thus, for every storage decision, managers have to consider the type of network, the size of drives and the required mirroring and replication. Each choice has implications for cost, but these decisions are often made haphazardly and can place data on storage alternatives of a different performance than applications require (21).

These limitations stem from the fact that the configuration of digital storage is dependent on its total capacity (13). This means, that in configuring a storage array, one needs to find the best combination among all possible configurations, in order for the system to achieve maximum performance. Furthermore, buying more capacity (i.e. extra storage space) is not a trivial matter, since the configuration may need to be re-organized.

Contrary to silicon-based methods, DNA offers a one-size-fits-all storage alternative. New data only need to be ‘translated’ from binary to the 4-base alphabet, {A, T, G, C}, synthesized as oligonucleotides and ‘thrown’ into the pool of DNA data. While it is important to consider DNA concentration for large volumes of data (>13million sequences were used for latest experiment of Microsoft (28)), there is no inherent limitation to the ability of the system to store information, since there is no need to design and allocate memory a priori for its components.

Finally, DNA-based data storage systems can greatly downsize the complexity of storage. Contrary to traditional storage systems that may require specialized configurations or use different subsystems to service file-handling, data encoded in DNA are always available and can be accessed by solely using an assigned set of primers to perform PCR (Fig. 8).

Figure 8. Flowcharts for the write (put) and read (get) processes of a DNA Library based on the work of Bornholt et al (5).

III. With great power come … new problems

While DNA is an excellent candidate for storing information, and a series of proof-of-principle experiments have already demonstrated its value as a storage medium, there are several obstacles that need to be considered.

The theoretical density of 1 exabyte of data per gram has yet to be achieved.

Looking again at Table 1 (Part I), it is evident that progress has been rapid: starting from 2012, the physical density of DNA storage systems has scaled from 1.24 Pbytes/g to 215 Pbytes/g in just five years. At the same time, error detection and correction techniques, as well as various methods for increased robustness have been utilized, further increasing the potential of DNA as storage medium. Nonetheless, no work has managed to demonstrate an encoding scheme strong enough to break the barrier of Pbytes/g. Undoubtedly, as the field attracts more attention and different strategies from coding theory are employed, the predicted physical density (1 Exabyte/g) will be realized.

Standard limitations of DNA synthesis and sequencing.

While the swift from silicon-based technology offers a variety of new perspectives and possibilities, it also carries problems of a different nature. First, not all DNA sequences are created equal (8,16). Biochemical constraints dictate that DNA sequences with high GC content or long homopolymer runs (e.g. AAAAAA…) should be avoided as they are difficult to synthesize and prone to sequencing errors. Second, oligo synthesis, PCR amplification, and decay of DNA during storage are all processes that induce uneven representation of the oligos (25,27). This might result in dropout of a small fraction of oligos that will not be available for decoding. In addition to the biochemical constraints, oligos are sequenced in a pool and necessitate indexing to infer their order, which further limits the real estate for encoding information (5,6,7,12). Referring to Table 1, an inspection of the first and the forth row, reveals the difference between the Coding Potential, i.e. the maximum number of bits that a nucleotide can represent, and the Net Density, which refers to the resulting number of bits represented by a nucleotide. For example, by adding indexes, the capacity of DNA Fountain (7) is downsized from 1.98 bits/nt (or ~270 Pbytes/g) to 1.57 bits/nt (or the observed 215 Pbytes/g).

Cost of synthesis is still a significant factor.

While this point is self-explanatory, it is worth mentioning that for the work of Erlich & Zielinski (7), the cost for encoding a 2.1 MByte file in 152 (length of 1 oligo) x 72000 oligos (total number of oligos) without Illumina adapters, for subsequent PCR steps, was $7000. Synthesizing 1 GByte with this scheme would cost ~$3.27 million. Nonetheless, using previous methods would set the synthesis price at nearly $5.63 million. It is understandable that even if synthesis price falls by two orders of magnitude (~$6.4/Mbase) it would take ~$34000 to synthesize 1GByte which puts TBytes and PBytes in the scale of many millions and billions, respectively.

Current architecture only support write/read functions.

As described in Part I (see Notable Works), works on DNA-storage have mostly focused on different ways to encode data in DNA sequences, i.e., the stage that translates the original data to sequences, that will later be synthesized. At the same time, architecture of the molecules used is largely disregarded: a linear DNA sequence representing the data, an address block (index) to infer order of the oligo-nucleotides, and adapters that facilitate PCR amplification, with writing and reading of the store information, being the only processes allowed (Fig. 4).

Figure 9. An overview of the DNA data encoding format. After translating to nucleotides, the stream is divided into strands. Each strand contains a payload from the stream, together with addressing information to identify the strand and primer targets necessary for PCR and sequencing.

While this trivial architecture is sufficient for testing encoding schemes for several Megabytes, using the DNA storage for large data volumes would require the implementation of systems that enable more processes than simply storing and retrieving information, such as intrusion prevention and intrusion detection. Within this framework, another critical feature would be the ability to verify whether all desired information is successfully saved, without having to sequence all the oligonucleotides that consist the DNA storage. It is understandable, that the present architecture does not allow for such a process.

Final Remarks

DNA-based storage has the potential to be the ultimate auxiliary storage solution: it is extremely dense and durable. At the same time, it offers a one-size-fits-all alternative, surpassing silicon-based technology that needs to be configured properly, and it can easily handle many copies of the same data. While it is true, that it is not practical yet due to the current state of DNA synthesis and sequencing, both technologies are improving at an exponential rate with advances in the biotechnology industry. Furthermore, the design of new architectures that allow more complex schemes and processes is imperative. Nonetheless, given the rising costs and growing demand for storage, combined with the impending limits of silicon technology, it seems that the time is ripe to consider incorporating biomolecules as an integral part of storing information, with DNA-based storage being a clear example of this direction.

References

(1) Alberghina, L., and Westerhoff, H.V. (2005). Systems Biology: Definitions and Perspectives. New York: Springer.

(2) Allentoft, M. E., Collins, M., Harker, D., Haile, J., Oskam, C. L., Hale, M. L., Campos, P. F., Samaniego, J. A., Gilbert, M. T. P., Willerslev, E., Zhang, G., Scofield, R. P., Holdaway, R. N., and Bunce, M. (2012). The half-life of DNA in bone: measuring decay kinetics in 158 dated fossils. Proceedings of the Royal Society of London B: Biological Sciences, 279(1748): 4724–4733.

(3) Andrianantoandro, E., Basu, S., Karig, D.K., and Weiss, R. (2006). Synthetic Biology: New Engineering Rules for an Emerging Discipline. Molecular Systems Biology, 2: 2006.0028.

4()Appuswamy, R., Borovica-Gajic, R., Graefe, G., and Ailamaki, A. (2017). The Five-minute Rule Thirty Years Later and its Impact on the Storage Hierarchy. In ADMS.

(5) Bornholt, J., Lopez, R., Carmean, D. M., Ceze, L., Seelig, G., and Strauss, K. (2016). A DNA-Based Archival Storage System. In Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS ‘16). ACM, New York, NY, USA, 637–649.

(6) Church, G. M., Gao, Y., and Kosuri, S. (2012). Next-generation digital information storage in DNA. Science, 337(6102):1628.

(7) Erlich Y., and Zielinski, D. (2017). Capacity-approaching DNA storage. Science, 355(6328), 950–954.

(8) Eroshenko, N., Kosuri, S., Marblestone, A.H., Conway, N., and Church, G.M. (2012). Gene assembly from chipsynthesized oligonucleotides. Current protocols in chemical biology, pages 1–17.

(9) Faircloth, B.C., and Glenn, T.C. (2012). Not all sequence tags are created equal: designing and validating sequence identification tags robust to indels. PloS one, 7(8).

(10) Fontana, R. (2018). A Ten Year Storage Landscape LTO Tape Media, HDD, NAND.http://storageconference.us/2018/Presentations/Fontana.pdf

(11) Grass, R. N., Heckel, R., Puddu, M., Paunescu, D., and Stark, W. J. Robust chemical preservation of digital information on DNA in silica with error-correcting codes. (2015). Angew. Chem. Int. Ed., 54.

(12) Goldman, N., Bertone, P., Chen, S., Dessimoz, C., LeProust, E. M., Sipos, B., and Birney, E. (2013). Towards practical, high-capacity, low maintenance information storage in synthesized DNA. Nature, 494:77–80.

(13) IBM Knowledge Center. (2017). Storage Configuration and Management. https://www.ibm.com/support/knowledgecenter/STPVGU_8.1.0/com.ibm.storage.svc.console.810.doc/svc_svcplanconfiguration_21gf04.html

(14) IDC. (2018). The Digitization of the World From Edge to Core. https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf

(15) Intel. Cold Storage in the Cloud: Trends, Challenges, andSolutions. White Paper. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/cold-storage-atom-xeon-paper.pdf

(16) Kaplan, J.M., Roy, R., and Srinivasaraghavan, R. (2018). How can businesses optimize the management of the cost, volume and complexity of the stored information? McKinseyQuarterly.

(17) Khalil, A.S., Collins, J.J. (2010) Synthetic biology: applications come of age. Nar Rev Genet, 11(5): 367–379.

(18) Knuuttila, T., and Loettgers. A. (2013). Basic Science Through Engineering? Synthetic Modeling and the Idea of Biology-inspired Engineering. Studies in History and Philosophy of Biological and Biomedical Sciences, 44(2): 158–169.

(19) Kosuri, S., Eroshenko, N., Leproust, E. M., Super, M., Way, J., Li, J. B. and Church, G. M. (2010). Scalable gene synthesis by selective amplification of DNA pools from high-fidelity microchips. Nat. Biotechnol., 28(12):1295-1299.

(20) McKinsey Quarterly. (2017). How the semiconductor industry is taking charge of its transformation. https://www.mckinsey.com/industries/semiconductors/our-insights/how-the-semiconductor-industry-is-takingcharge-of-its-transformation

(21) Newman, D. (2016). Data Storage is Not One Size Fits All. Converge.XYZ

(22) The Royal Academy of Engineering. (2009). Synthetic Biology: scope, applications and implications.

(23) Mordor Intelligence. (2018). Next-generation Storage Market — Segmented by Storage System (Direct-attached, Network-attached, Cloud), Storage Architecture (File- and Object-based, Block), Storage Technology (Magnetic, Solid State), End-user Industry (BFSI, Retail, IT and Telecom), and Region — Growth, Trends, and Forecast (2019‚2024).

(24) D. Rosenthal. (2016). The Medium Term Prospects for Long Term Storage. https://blog.dshr.org/2016/12/the-medium-term-prospects-for-long-term.html

(25) Ross, M. G., Russ, C., Costello, M., Hollinger, A., Lennon, N. J., Hegarty, R., Nusbaum, C., and Jae, D. B. (2013). Characterizing and measuring bias in sequence data. Genome Biol., 14(5).

(26) SpectraLogic. (2018). Digital Data Storage Outlook 2018.

(27) Schwartz, J. J., Lee, C., and Shendure, J. (2012) Accurate gene synthesis with tag-directed retrieval of sequenceverified DNA molecules. Nat. Methods, 9(9):913–915.

(28) Twist Bioscience. (2017). Twist Bioscience Expands Agreement to Pursue Higher Density Digital Data Storage on DNA with Microsoft and University of Washington.

--

--

Evangelos Marios Nikolados

Synthetic Biologist fell in love with AI. Follow for deep dives on SynBio, DL, Crypto, and Greece! Let’s chat on: https://www.linkedin.com/in/emnikolados/