Beyond the Hard Drive: Encoding Data in DNA

This article is part of a series about how OS Fund (OSF) companies are radically redefining our future by rewriting the operating systems of life. Or as we prefer to think about it: Step 1: Put a dent into the universe. And Step 2: Rewrite the universe. You can see the full OSF collection here and read more about Building a Biological Immune System.

In contemplating the future, I love imagining how our daily lives today will be thought of in the future. What appears sci-fi to us today but will be “normal” 50 years from now? What inefficient and boneheaded things do we do today that future generations will look back and laugh at?

Seeing beyond what’s possible is a rare skill. Being able to design and build beyond what’s possible is even more rare. Put together, this is the unique set of skills and abilities that OSF founders all have in common. Most importantly, they’ve chosen to focus their abilities to tackling the biggest problems humanity faces.

But who are they? What makes them tick? Why do this versus other things? And how might their technologies change the world? These are their stories.

I. More Data Than We Know What to do With

We are in a golden age of information. Over 90% of all the data created throughout the history of humanity has been generated in just the past two years. The world’s population is creating 2.5 quintillion (10¹⁸) bytes of data every day, and every person will soon be generating 1.7 MB of data each second of their lives. And that’s just humans. The earth is generating orders of magnitude more data than that every minute.

The speed at which we make data is outpacing our ability to store, transport, and access it. This is problematic because managing this massive increase requires energy (the IT industry burns between 7–12% of global energy every year!) and physical space. It’s slightly counterintuitive, but digital storage takes up space and energy, too. Saving to the cloud requires enormous warehouses of servers that consume massive amounts of energy to maintain at the right temperature and humidity.

As recently as a few decades ago, we mostly relied on good ol’ pen and paper to store our data. Then we shifted to magnetic tapes and disks before graduating to digital storage. In some ways, the graduation wasn’t all progress: data still has to be migrated every generation to the newest technology, to ensure that it can still be preserved and read out.

How can our storage abilities keep up with our data generation and recording ability? How can we store data in a medium that will never be obsolete? How can we preserve our history long into the future?

What if we could find a way to store the world’s digital knowledge bank in a way that wasn’t subjected to global or regional power loss, consumed a fraction of the energy resources of existing IT infrastructure, and was cost effective?

The answer was all around us this whole time. In us. It is us: DNA. What if we could store all the world’s information in DNA?

I met recently with Catalog, which is attempting to do just that, and immediately knew that they were on to something big.

II. Small, Efficient, and Durable: Why DNA is the Future of Data Storage

Catalog co-founders Hyunjun Park and Nathaniel Roquet are working to make DNA the next-generation, mainstream storage medium for digital data.

Park has a Ph.D. in microbiology; Roquet has a Ph.D. in biophysics. The two met at MIT and saw a world-changing opportunity to solve the challenge of data storage.

Catalog has invented a methodology that would allow them to fit all of the world’s data into a coat closet and store it for…a very, very long time.

“DNA is an extremely stable material,” Park explained.

We know this because we have found mostly intact DNA from preserved animals hundreds of thousands of years old.

“You see things like horses that were frozen in permafrost in Canada for 700,000 years, and you’re still able to read back the genome of that animal,” Park explained. “The fact that our genetic information is encoded in the medium means that we’ll always be able to read this back. We won’t need to worry about the reading technology evolving past the medium.”

Photo: pip / photocase.com

While early success with storing data as DNA strands using synthetic biology was achieved in 1986, the practice itself is still relatively undefined for purposes of mass adoption. (Recently, entire books and short movies have been stored as DNA, as proofs of principle.)

But Catalog is the first company that has a shot at scalability in a market that’s been benefiting from bulky hardware and planned obsolescence. This is because they’ve found a way that may, within a few years, make DNA storage cheaper than tape storage today.

“There’s currently no way for an individual to physically own really large amounts of data. But it’s easy to just keep a capsule of DNA that’s encoding petabytes of data,” said Park.

“It’s a lot easier to ship small vials of DNA than semi trucks full of hard drives. Right now, the internet is very slow, and when you’re sending petabytes of information, the largest bandwidth you’re going to get is FedEx. In fact, it is cheaper to physically move a company’s hard drives via semi trucks than it is to send over the internet.”

III. DNA Storage Benefits: Universality, Density, and Cost

All life uses DNA (or some form of nucleic acids) for storage. Evolution has settled on one of the most efficient, long-term data storage options we know of, so why wouldn’t we take advantage of its inherent properties to preserve the creations of humanity as well?

DNA is one million times more information dense than flash drives, so storing and transporting it would require significantly fewer resources than the status quo, drastically reducing both the environmental footprint and cost of data storage.

DNA is also much cheaper and easier to copy — something we likely take for granted since our bodies do it automatically countless times a day.

“A thousand copies of the same information doesn’t cost a thousand times the amount of one, as it does with flash drives or hard drives,” Park pointed out.

IV. How DNA Storage Works

Tl;dr: Step 1: Build a grid; Step 2: Encode the grid.

Catalog’s novel methodology relies on a deep understanding of the age-old philosophical question: What is information?

When it comes to digital information, data is just a series of ones and zeros. (DNA has at least four units of information — A,T,G, and C — which can be reduced to zeros and ones as long as one maintains a code. The ones and zeros get stored as ATGCs.)

The bottleneck is the high cost of printing out strands of whatever DNA you want, about ten cents per nucleotide today. Think of it like printing beads on a string, each bead costing a dime. The human genome, for example, in full, would cost 320 million dollars to print today!

How to get around the bottleneck?

Catalog’s new method breaks many storage and cost bottlenecks by synthesizing large quantities of just a few different DNA molecules and mixing them in different combinations to generate a huge variety of different molecules. These molecules are then used in conjunction with their innovative encoding methods to represent long series of 1s and 0s.

Crudely, this is the same cost-saving technique that, say, the manufacturer of Legos uses. It is very expensive to make the cast mold for a new piece, so Lego designers are generally encouraged to make new sets with pre-existing bricks that can be made cheaply and easily.

So, Lego makes a finite amount of mass-produced bricks, and the information is contained in the instruction manual, which comes along and tells those basic pieces where to go in a near-infinite array of sets. These instruction manuals are the equivalent of Catalog’s encoding scheme.

Whereas old methods rely on producing an indeterminate number of new bricks each time from scratch, Catalog pre-defines all of the bricks that will be used and makes large quantities of them beforehand. Then, each time something new needs to be stored, they simply print out a new instruction manual.

Photograph by Zane Thorn

To store one terabyte of data in DNA, for example, they only need to use a few hundred “bricks”. They’ve taken the code of life itself and made it more efficient.

“It doesn’t really matter that it’s a song, or a picture, or video — as long as it can be distilled to a series of ones and zeros, we can use the same protocol for doing it,” Park explained.

“What we’ve done differently in Catalog, is that we began by asking what information is and what the best way is for us to represent it using DNA. This was so that we do not constrain ourselves by the way DNA is used in nature.”

V. DNA Storage in Action

While Catalog’s approach may sound like something so technologically complicated it would be limited only to synthetic biologists, you wouldn’t have to know how to sequence DNA to use DNA storage, just like you don’t need to know how to set up a server farm to use the cloud.

In theory, anybody with encoded synthetic DNA could send it to any number of vendors to do the DNA sequencing for them.

“When you store things on Amazon AWS, you don’t worry whether that’s being backed up in Blue-Rays, or magnetic tape or hard drives. You just care that the information is safe somewhere and that you’re able to retrieve it. We want the customers not to have to worry about whether it’s ever backed up in DNA or not, just that they’re getting a really good service and have peace of mind that it has multiple redundancies.”

Park’s hope is that storing data in DNA will become as ubiquitous a practice as saving a photo in iCloud. Their first storage focus will be archival data: data which is typically stored for a long time in dead-tree libraries, and that isn’t recalled too frequently. Archivists, librarians, governments — anyone dedicated to maintaining accurate historical records, would have an immediate use case for Catalog’s product.

Currently, the Library of Congress is planning a symposium on data preservation. Park will be there, along with representatives of the Internet Archive (the creators of the Wayback Machine). “They want to be the Library of Alexandria for the modern world, and to make knowledge accessible by everyone,” Park explained, about the Internet Archive.

Library of Congress

“They store a lot of data as a result of that. They want to keep that safe in perpetuity, and we’d love to add a layer of safety to their mission using DNA.”

VI. The Future of Information: DNA Across the Universe

So far, Catalog DNA has proven its capacity by encoding full books, like Douglas Adams’ The Hitchhiker’s Guide to the Galaxy, into DNA. Soon, they will be translating entire libraries of books into molecular codes. And one day, they will be able to store all the data the world has ever generated in just a room full of DNA.

Once they pull that off, it’s not a far cry to imagine disseminating Earth’s data to all corners of space, which they’re already working on with Arch Mission.

“In thinking about colonizing other planets, if we want to send the entire internet to Mars for human civilization to continue there, there’s really no other way to store and send all of that information, except by DNA,” Park told me.

Here on Earth, Park said he would love to see data-encoded DNA stored somewhere future-proof, like the Svalbard Global Seed Vault on an island, in Norway.

Seeds, of course, are just nature’s way of storing information about a plant, in DNA. Placing all of human knowledge — everything we are, everything we’ve ever done, everything we’ve ever made — in the same seed vault seems somehow fitting.

If we want to make sure the coming generations carry the wisdom of the world forward, so that they too can stand on the shoulders of giants, we need to make sure they won’t lose that opportunity in one flash of a solar flare. We need to make sure that wisdom is packaged in a way that will stand the test of time. Catalog is on the path to making that a reality.