Animated sequence of a race horse galloping, from the original photos. Photos taken by Eadweard Muybridge (died 1904), first published in 1887 at Philadelphia (Animal Locomotion). Source: Wikimedia Commons.

The galloping horse GIF is just the trailer…

Kevin O'Connell
BioQuest

--

In the July 20, 2017 issue of Nature, Shipman and colleagues in George Church’s lab at Harvard reported the use of CRISPR/Cas to encode a brief, low-resolution movie (a five-frame GIF) into the genomes of a population of E. coli bacteria. The news was widely discussed in the tech press, including Wired, TechCrunch, Technology Review, Mashable, and others. While certainly clever, the work continues a string of reports (summarized in this 2016 report by Blawat et al.) in which entire books, videos, silent movies, audio clips and images have been encoded and stored in DNA.

Short movie: so what?

So what did the Church lab report last month, and why is it remarkable? The data capture feat alone was unremarkable: five low-resolution frames that form a GIF, taken from Eadweard Muybridge’s Human and Animal Locomotion that together represent a total of only 2600 bytes (as opposed to a 22 Mb video encoded in DNA in 2015, referenced in the Blawat paper). The information storage capacity of DNA is phenomenal; all of the information ever recorded by humanity in any form could be stored in a two-car garage if encoded in DNA.

We have already sequenced the genomes of many individual Neanderthals, who last walked the earth over 30,000 years ago. Their DNA was recovered from their bones, found in caves. The fact is that DNA is stable in a way that magnetic or optical storage just can’t duplicate. Newly synthesized DNA that has been purified and is stored in a cool, dry, dark place has a potential lifetime of well over 100,000 years, so DNA as a medium will be readable so long as humans remain a technological species. (Anyone have a Zip drive handy?) Therefore, large amounts of data can be encoded in DNA, and molecules can be synthesized that embody the encoded sequence, and stored essentially indefinitely. We don’t have a practical need at the moment to store data for multiple millennia, but that kind of stability and information density recommend DNA as an archival storage medium when DNA synthesis (writing) and DNA sequencing (reading) reach certain price points (based on how often you need to rewrite your archive).

What sets this study apart from the others is the storage of the images in living cells. While some past workers have embedded “watermarks” or “signatures” in the engineered genomes of living cells, the horse GIF is the first complex dataset encoded, cultured, extracted, sequenced and reconstructed from a live bacterial population.

In Living Memory…

Why is this important? The ability for living cells to store data is half of a system in which cells sense their environments (internal or external) and create a temporally sequenced record of their experience. For example, brain cells could record and later tell us how they developed their specific functions, helping us understand how the brain develops and works. Cells throughout the body could record their exposure to infection in ways that would speed diagnosis. Microbes synthesizing products during industrial fermentation could preserve a living record of process conditions to enhance quality control.

The mechanism used by the authors to incorporate data into cells of E. coli, called CRISPR/Cas, is derived from a defense mechanism in which bacteria, in a certain sense, record their own history of exposure to viral threats. When attacked by a virus, a bacterium that possesses the CRISPR/Cas system clips a portion of the genome of the virus and writes that fragment into its own genome. That genomic record serves as a kind of immunological memory, allowing the system to “counterattack” future attacks by the same kind of virus. Because the snippet of viral DNA has become part of the bacterium’s genome, all its descendants inherit this “memory” of the viral exposure.

The Harvard authors are obviously not encoding GIFs so that bacteria can show us movies. However, they do look forward to a day when cells can sense an external (or internal) stimulus, encode that event in a DNA sequence, and save the “memory” of that stimulus in their genomes so that we can read it later. Those mechanisms for sensing stimuli and encoding the results aren’t addressed in this paper — the authors performed all that data transformation manually, as they explain. We are years away from developing cells that can transform sensation and record complex data into their genomes. The authors’ intent was to illustrate the use of living cells as a data repository, and to explore some of the relevant performance characteristics. For example, the efficiency of data recovery from a static image of one of the author’s hands was 100%, but the efficiency of data recovery for the GIF was just 90%. While 90% accuracy might not suffice for critical data, it may be sufficient for the playback of some kinds of recorded events.

Some Clever Features

The data encoding techniques used in the study do illustrate a few interesting phenomena. First, an interesting feature of the CRISPR/Cas system in bacteria is that each record of a viral attack is saved in chronological order. Shipman et al. exploit this feature to insert the data for each frame of the GIF into bacteria, and then reconstruct the frames of the GIF, in their correct order. So data can already be encoded in temporal sequence.

Second, the use of an entire bacterial population to store the data is both clever and convenient. Our ability to synthesize DNA is limited to strings about 100 “characters”, or bases, long. To store the entire dataset encoding the GIF in a single cell, one would need to design that entire sequence, synthesize it in fragments, assemble them into a single molecule, and then clone that molecule into a bacterial genome. However, this is not the approach taken by the authors. To avoid the task of assembling the entire GIF manually into one large sequence for storage, the authors developed an encoding scheme in which they encoded small groups of pixels (“pixets”) into individual DNA molecules about 100 bases long. Each molecule included an index sequence to identify the pixet. The authors designed a group of molecules that collectively encode one frame of the GIF and introduced the group into a population of E. coli. After a “rest period”, molecules encoding the next frame of data were introduced and so on until all five frames’ worth of data were introduced into the genomes of the target population.

This approach saves the encoding scientist the trouble of assembling the data into a single large molecule. The downside is that it is possible, as the authors observed, that some molecules may not be effectively taken up by the bacterial population, resulting in gaps or holes in the resulting image. To improve the capture and retention of data, one could increase the redundancy of each data-containing molecule in the experiment. Alternatively, one could improve the encoding scheme to avoid sequences that the target cells may have difficulty incorporating (this strategy is discussed in some detail in the paper).

Third, one can see from the reconstructed GIF that the missing pixels are not contiguous. The author’s encoding scheme for pixets is cleverly designed to ensure that no pixet contains pixels that are adjacent in the image, so that if an entire pixet is lost, the damage to any one area of the image is limited.

A Parting Thought

What more can be done to facilitate data storage in DNA generally, and specifically in living cells by the authors’ process? More than anything else, we need a breakthrough in DNA synthesis that precipitously drops the cost per base pair. We’ll discuss that in a future blog post.

B.Next is designing a biodefense technology strategy, demonstrating the potential that innovative tools and techniques can provide, and supporting the investment strategies of these innovations.

Check out our work at www.bnext.org

--

--