What We Are Going to Do With All That Data
A 3 billion-year-old answer to a very modern problem
It’s no secret that we are generating far more information than we can possibly store — more than 2.5 quintillion bytes per day. When you click an aging link and get that “404 file not found” message, you likely tried to access old information vaporized to make room for something newer. And “aging” may mean a few months old. Even with the insanely cheap cost of modern data storage, we can’t keep everything. When you think about storing all this information on the scale of a decade or century, the problem is staggering.
Our solution to storing massive amounts of information may be a 3 billion-year-old technology: DNA.
DNA has a couple of inherent advantages over silicon. Its basic unit, the nucleotide, encodes 4 bits and has dimensions of about 1 nanometer (a millionth of a millimeter). Silicon transistors, by contrast, encode only 2 bits and can’t get much smaller than 10 nanometers.
DNA information density is about one exabyte (a billion GB) per cubic millimeter. This is a billion times more dense than the most advanced technologies available today. A hectare’s worth of storage tapes (20,000 m³, a very large warehouse) could be reduced to 20 cubic centimeters, a volume smaller than an iPhone.
Limitations of DNA
Sounds great. But of course, we are not there yet. The principal obstacles to using DNA devices are the accuracy and speed of read-write operations. Cost is an issue now but will fall dramatically as technologies mature and scale.
In biological systems, the error rates of DNA readout range from 1 in 10,000 to 1 in 1,000,000,000. The lower values are a thousand times greater than that of SATA drives. Even then, they are achieved only through the use of elaborate proofreading machinery.
The speed of that readout is even more problematic. It is about 90 nucleotides per second in RNA transcription, equal to 45 bytes per second. At that rate, the average web page of 3MB would take over 18 hours to load.
Slow but steady
Although the enzymes that read and write DNA are slow, they are very energy-efficient. DNA replication is a major energy drain for single-celled organisms (all organisms were single-cell until about 500 million years ago). Replication has been under selective pressure for a couple of billion years. The result is efficiencies that blow away those of modern technologies. It takes one ten-billionth of a Watt to write a GB of data in DNA. That’s the amount of solar power available per square micron (a thousandth of a millimeter) at the Earth’s surface.
DNA is also very stable, as befits its role as a repository of genetic information. We can recover DNA from environmental samples that are thousands of years old. DNA is damaged by ionizing radiation, strong oxidizers, and chemical breakdown in watery environments. But in mild conditions, DNA-encoded information will stay intact for thousands of years.
Where we are today
Here’s a summary comparison of silicon and DNA-based information storage:
These are basic properties that give some idea of the potential of DNA computing. We are still very much at the early stages of figuring out how to convert potential into performance.
But we are making progress:
•A UK group reported the writing and retrieval of Shakespeare’s Sonnets, Watson and Crick’s classic paper, a JPEG color image, and an MP3 file of a speech by Martin Luther King. This information (a Shannon density of about 5M bits) was recovered with 100% fidelity in a scalable system.
•A ‘DNA Fountain’ is able to store 2MB of data, including a movie and a complete computer operating system. The system is capable of quadrillions of retrieval operations. It achieves a storage density of 200 petabytes per gram.
•200MB of data were stored in a random-access format and recovered with no errors.
Despite these advances, the principal limitation for storing information in DNA is likely to be read-write speed. Any method that relies on chemical processes is limited by diffusion rates.
Diffusion is an incredibly slow process. It is about 15 orders of magnitude slower than the propagation of electrical signals. Perhaps advanced imaging methods, such as atomic force microscopy, could circumvent the need for solution-based information retrieval methods. They could speed this process up substantially — but not by anything like 15 logs.
Where DNA storage is going
Given this limitation, the impact of DNA storage devices will be small in the near term. We will not be wearing DNA-enabled devices on our wrists or using them to control our driverless cars. But they will have a significant role in storing huge sets of data for long periods of time. Information from gigantic particle physics and astrophysics projects are an example. 3D imaging of specimens in museums are another. Combined with 3D printing, this type of storage would allow us to replicate any artifact or specimen in any museum.
Some of the information stored this way will be used in the future to solve some very big problems. It could enable us to resurrect a keystone species or enable teleportation technologies. The impact of DNA storage devices will not be in preserving cat videos, but in retaining a full description of the natural world at a given point in time. That would be a very big impact indeed.
Drew Smith synthesized and sequenced DNA as a grad student in the early 1980s, using the crude technology of that primitive time.