A DNA-Based Archival Storage System

aka using DNA to store computer data

Andrew Linfoot
Jun 10, 2018 · 4 min read

At Pesto we preach “Programming Deliberately”, meaning engineers should understand every line of code that they write.

It’s not sufficient to write code that works. A good engineer should be able to explain in detail why it works.

To develop a better understanding of why things work, it’s useful to learn about programming concepts at least one level of abstraction deeper than you typically work with. If you use ExpressJS, you might want to consider learning how to implement body-parser. If you use Heroku, you should probably know how to set up a blank Ubuntu Digital Ocean droplet.

One way to really dive deep into a topic is to read academic research papers. Research papers by definition can’t leave any detail out. As part of their homework, every Pesto student needs to read an academic research paper from PapersWeLove.org and write a Medium post about at least five things that they learned from that paper. Since it’s only fair that I assign homework that I’d be willing to do myself, I’ve done this assignment as well. Here is what I learned.

The paper I read is A DNA-Based Archival Storage System. It studies the feasibility of using DNA as a means of data storage for binary data. This is a concept that I have heard about before but never actually understood the pros, cons or potential issues with implementing it.

You might wonder, why would we want to use DNA to store data anyway? Don’t hard drives work just fine?

Well it turns out that we produce a lot of data. We are producing data at a rate significantly faster than the growth rate of existing storage solutions’ ability to store that data.

DNA on the other hand is insanely good at storing data. You can store 1 exabyte of data per cubic millimeter. By comparison, Facebook recently built an entire datacenter dedicated to 1 exabyte of cold storage. On top of that, DNA has a half life of over 500 years. Your standard rotating disks only last 3–5 years.

Actually reading and writing large amounts of data to DNA is not an easy task. Given current DNA sequencing technologies, it is prohibit-ably expensive to be practical. However, the costs are rapidly decreasing, making the researchers believe it may one day be a viable solution.

Not only is reading and writing DNA expensive but it also can be inaccurate. A modern sequencing machine can add one additional base pair to a strand of DNA with 99% accuracy. This may seem pretty good but when encoding long pieces of information, this inaccuracy adds up quickly. Much of the paper involves the researchers exploring different encoding methods to cope with this inaccuracy.

For example, you can write the same piece of data to multiple partially overlapping strands, making it so that there are multiple copies of the data. You then can recover the original data in the event that one strand’s copy is inaccurate.

Image for post
Image for post

The problem with this method is that it requires many copies of the same data, leading to a decrease in data density. The researchers devised a new encoding method that involves taking two pieces of DNA data and writing third piece of DNA that is an exclusive-or of the other two pieces. In this case, you can recover any of the three pieces if you retained the other two.

Image for post
Image for post

They implemented this encoding technique in their lab by writing the following three images to DNA using the existing overlapping technique and their new exclusive-or technique.

Image for post
Image for post

They found that their new exclusive-or encoding technique was able to achieve the data redundancy necessary to have the same level of read accuracy as the existing overlapping technique. They were able to recover the data from DNA for all three images, while maintaining a 2.6x data density over the overlapping encoding technique.

We are still very far from seeing anything like DNA based hard drives in our MacBooks. However, reading this paper got the bio-geek in me excited and definitely made me interested in reading more CS academic papers. It forced me to dive much deeper into my knowledge of how sequencing machines work and made me spend some time googling information theory. Since I’ve never had a formal computer science education, many of the information theory details, like using exclusive-or for data redundancy, were new to me.

Hopefully you learned a few things from my brief summary. The paper goes into way more detail about things like:

  • Optimal ways to break up the data to work with sequencer strand length limits

If you want to dive deep into the details, feel free to read to the whole paper yourself or check out other interesting papers on PapersWeLove.org.

Pesto is a career accelerator for software engineers in India. We teach engineers how to be effective remote employees and then match them with international tech companies for full-time remote jobs.

If you are looking to hire engineers who dive deep, shoot me a message at andrew@pesto.tech.

If you are an engineer that wants to level up your skills, apply for our next batch here.

Pesto

Pesto is on a mission to give everyone equal access to…

Andrew Linfoot

Written by

Co-Founder @pestotech | Using education and remote work to give everyone equal access to opportunity, regardless of where they were born.

Pesto

Pesto

Pesto is on a mission to give everyone equal access to opportunity, regardless of where they were born.

Andrew Linfoot

Written by

Co-Founder @pestotech | Using education and remote work to give everyone equal access to opportunity, regardless of where they were born.

Pesto

Pesto

Pesto is on a mission to give everyone equal access to opportunity, regardless of where they were born.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store