What We Are Going to Do With All That Data

A 3 billion-year-old answer to a very modern problem

Drew Smith, PhD
Jan 14 · 5 min read
Lots of data generated every night. The center of the Milky Way galaxy imaged by NASA’s Spitzer Space Telescope Image credit: NASA/Ames/JPL-Caltech. Public domain.

It’s no secret that we are generating far more information than we can possibly store — more than 2.5 quintillion bytes per day. When you click an aging link and get that “404 file not found” message, you likely tried to access old information vaporized to make room for something newer. And “aging” may mean a few months old. Even with the insanely cheap cost of modern data storage, we can’t keep everything. When you think about storing all this information on the scale of a decade or century, the problem is staggering.

Our solution to storing massive amounts of information may be a 3 billion-year-old technology: DNA.

DNA has a couple of inherent advantages over silicon. Its basic unit, the nucleotide, encodes 4 bits and has dimensions of about 1 nanometer (a millionth of a millimeter). Silicon transistors, by contrast, encode only 2 bits and can’t get much smaller than 10 nanometers.

DNA information density is about one exabyte (a billion GB) per cubic millimeter. This is a billion times more dense than the most advanced technologies available today. A hectare’s worth of storage tapes (20,000 m³, a very large warehouse) could be reduced to 20 cubic centimeters, a volume smaller than an iPhone.

Binary transcoding methods used in DNA-based data storage schemes. (A) One binary bit is mapped to 2 optional bases. Two binary bits are mapped to 1 fixed base. (B) Eight binary bits are transcoded through Huffman coding and then transcoded to 5 or 6 bases. C) Two bytes (16 binary bits) are mapped to 9 bases [12]. (D) Eight binary bits are mapped to 5 bases. From Carbon-based archiving: current progress and future prospects of DNA-based data storage. Creative Commons Attribution License

Limitations of DNA

Sounds great. But of course, we are not there yet. The principal obstacles to using DNA devices are the accuracy and speed of read-write operations. Cost is an issue now but will fall dramatically as technologies mature and scale.

In biological systems, the error rates of DNA readout range from 1 in 10,000 to 1 in 1,000,000,000. The lower values are a thousand times greater than that of SATA drives. Even then, they are achieved only through the use of elaborate proofreading machinery.

The speed of that readout is even more problematic. It is about 90 nucleotides per second in RNA transcription, equal to 45 bytes per second. At that rate, the average web page of 3MB would take over 18 hours to load.

Readout of DNA information by RNA polymerase (Public domain)

Slow but steady

Although the enzymes that read and write DNA are slow, they are very energy-efficient. DNA replication is a major energy drain for single-celled organisms (all organisms were single-cell until about 500 million years ago). Replication has been under selective pressure for a couple of billion years. The result is efficiencies that blow away those of modern technologies. It takes one ten-billionth of a Watt to write a GB of data in DNA. That’s the amount of solar power available per square micron (a thousandth of a millimeter) at the Earth’s surface.

DNA is also very stable, as befits its role as a repository of genetic information. We can recover DNA from environmental samples that are thousands of years old. DNA is damaged by ionizing radiation, strong oxidizers, and chemical breakdown in watery environments. But in mild conditions, DNA-encoded information will stay intact for thousands of years.

Where we are today

Here’s a summary comparison of silicon and DNA-based information storage:

See DNA as a digital information storage device: hope or hype? for a good summary of DNA vs silicon

These are basic properties that give some idea of the potential of DNA computing. We are still very much at the early stages of figuring out how to convert potential into performance.

But we are making progress:

•A UK group reported the writing and retrieval of Shakespeare’s Sonnets, Watson and Crick’s classic paper, a JPEG color image, and an MP3 file of a speech by Martin Luther King. This information (a Shannon density of about 5M bits) was recovered with 100% fidelity in a scalable system.

•A ‘DNA Fountain’ is able to store 2MB of data, including a movie and a complete computer operating system. The system is capable of quadrillions of retrieval operations. It achieves a storage density of 200 petabytes per gram.

•200MB of data were stored in a random-access format and recovered with no errors.

Despite these advances, the principal limitation for storing information in DNA is likely to be read-write speed. Any method that relies on chemical processes is limited by diffusion rates.

Diffusion is an incredibly slow process. It is about 15 orders of magnitude slower than the propagation of electrical signals. Perhaps advanced imaging methods, such as atomic force microscopy, could circumvent the need for solution-based information retrieval methods. They could speed this process up substantially — but not by anything like 15 logs.

Slow, but much faster than DNA read-write ops. From Open Data. Public domain.

Where DNA storage is going

Given this limitation, the impact of DNA storage devices will be small in the near term. We will not be wearing DNA-enabled devices on our wrists or using them to control our driverless cars. But they will have a significant role in storing huge sets of data for long periods of time. Information from gigantic particle physics and astrophysics projects are an example. 3D imaging of specimens in museums are another. Combined with 3D printing, this type of storage would allow us to replicate any artifact or specimen in any museum.

Some of the information stored this way will be used in the future to solve some very big problems. It could enable us to resurrect a keystone species or enable teleportation technologies. The impact of DNA storage devices will not be in preserving cat videos, but in retaining a full description of the natural world at a given point in time. That would be a very big impact indeed.

The long-extinct dodo. By Frederick William Frohawk. Public Domain, https://commons.wikimedia.org/w/index.php?curid=55214730

Drew Smith synthesized and sequenced DNA as a grad student in the early 1980s, using the crude technology of that primitive time.

The Startup

Medium's largest active publication, followed by +566K people. Follow to join our community.

Drew Smith, PhD

Written by

Scientist, writer, long-distance hiker

The Startup

Medium's largest active publication, followed by +566K people. Follow to join our community.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade