A DNA-Based Archival Storage System
aka using DNA to store computer data
--
At Pesto we preach “Programming Deliberately”, meaning engineers should understand every line of code that they write.
It’s not sufficient to write code that works. A good engineer should be able to explain in detail why it works.
To develop a better understanding of why things work, it’s useful to learn about programming concepts at least one level of abstraction deeper than you typically work with. If you use ExpressJS, you might want to consider learning how to implement body-parser. If you use Heroku, you should probably know how to set up a blank Ubuntu Digital Ocean droplet.
One way to really dive deep into a topic is to read academic research papers. Research papers by definition can’t leave any detail out. As part of their homework, every Pesto student needs to read an academic research paper from PapersWeLove.org and write a Medium post about at least five things that they learned from that paper. Since it’s only fair that I assign homework that I’d be willing to do myself, I’ve done this assignment as well. Here is what I learned.
The paper I read is A DNA-Based Archival Storage System. It studies the feasibility of using DNA as a means of data storage for binary data. This is a concept that I have heard about before but never actually understood the pros, cons or potential issues with implementing it.
You might wonder, why would we want to use DNA to store data anyway? Don’t hard drives work just fine?
Well it turns out that we produce a lot of data. We are producing data at a rate significantly faster than the growth rate of existing storage solutions’ ability to store that data.
DNA on the other hand is insanely good at storing data. You can store 1 exabyte of data per cubic millimeter. By comparison, Facebook recently built an entire datacenter dedicated to 1 exabyte of cold storage. On top of that, DNA has a half life of over 500 years. Your standard rotating disks only last 3–5 years.
Actually reading and writing large amounts of data to DNA is not an easy task. Given current DNA sequencing technologies, it is prohibit-ably expensive to be practical. However, the costs are rapidly decreasing, making the researchers believe it may one day be a viable solution.
Not only is reading and writing DNA expensive but it also can be inaccurate. A modern sequencing machine can add one additional base pair to a strand of DNA with 99% accuracy. This may seem pretty good but when encoding long pieces of information, this inaccuracy adds up quickly. Much of the paper involves the researchers exploring different encoding methods to cope with this inaccuracy.
For example, you can write the same piece of data to multiple partially overlapping strands, making it so that there are multiple copies of the data. You then can recover the original data in the event that one strand’s copy is inaccurate.
The problem with this method is that it requires many copies of the same data, leading to a decrease in data density. The researchers devised a new encoding method that involves taking two pieces of DNA data and writing third piece of DNA that is an exclusive-or of the other two pieces. In this case, you can recover any of the three pieces if you retained the other two.
They implemented this encoding technique in their lab by writing the following three images to DNA using the existing overlapping technique and their new exclusive-or technique.
They found that their new exclusive-or encoding technique was able to achieve the data redundancy necessary to have the same level of read accuracy as the existing overlapping technique. They were able to recover the data from DNA for all three images, while maintaining a 2.6x data density over the overlapping encoding technique.
We are still very far from seeing anything like DNA based hard drives in our MacBooks. However, reading this paper got the bio-geek in me excited and definitely made me interested in reading more CS academic papers. It forced me to dive much deeper into my knowledge of how sequencing machines work and made me spend some time googling information theory. Since I’ve never had a formal computer science education, many of the information theory details, like using exclusive-or for data redundancy, were new to me.
Hopefully you learned a few things from my brief summary. The paper goes into way more detail about things like:
- Optimal ways to break up the data to work with sequencer strand length limits
- How to mark the data so that you can read just the information you need without sequencing all of the other DNA in the storage container
- Issues with the chemical stability of different base pair sequences at different temperatures
If you want to dive deep into the details, feel free to read to the whole paper yourself or check out other interesting papers on PapersWeLove.org.
Pesto is a career accelerator for software engineers in India. We teach engineers how to be effective remote employees and then match them with international tech companies for full-time remote jobs.
If you are looking to hire engineers who dive deep, shoot me a message at andrew@pesto.tech.
If you are an engineer that wants to level up your skills, apply for our next batch here.