Drinking from the Fire Hose

Published in

TL;DR Innovation

4 min readApr 17, 2018

Storage Algorithms for Genomic Data

Early 20th century economist Thorstein Veblen is attributed with the quip “Invention is the Mother of Necessity.” More than a witty reversal of the original idiom, Veblen’s observation is a recognition of the circle of life involving problems and solutions. For example, when Apple developed the iPhone it created a huge need for docking stations, colorful cases, protective laminates, and apps. This cycle of [problem]-[solution]-[more problems] is an important area of study within the discipline of Systems Science and is particularly evident in the wake of the Human Genome Project. Given the project’s goal to determine the sequence of the 3 billion chemical base pairs that make up human DNA, at first glance 3 Gigabytes (GB) of data does not seem imposing. However, current technology requires that an individual sample of human DNA be sequenced an equivalent of 28 times (called 28X) in order to obtain reliable sampling statistics — resulting in the generation of around 100 GB of raw sequence data. The storage of this amount of data associated with a single sample is not beyond current commercial technology, but processing, searching, analyzing and copying 100 GB files is at the very least time consuming and inconvenient.

Logo by U.S. Department of Energy on Wikimedia

Now that analysis of the entire Human Genome is possible, eight teams are currently competing for the Archon X Prize for Genomics which seeks the development of technology capable of sequencing 100 human genomes in under 10 days while the 1000 Genomes Project is working to catalog 1000 human genomes to take an initial look at genetic variation. As these projects continue to spur new inventions and opportunities, the necessity for increased data storage, transfer and processing technology will continue to explode. The development of new data compression techniques can help to reduce file size while new storage technologies are being invented.

Genome data files primarily consist of ordered lists of bases from an alphabet comprised of the letters A, C, G and T. We could save the file by printing the ordered list of 3 billion bases on an extremely large poster. One way to conserve paper would be to reduce the font size of the letters — specifically, reducing the physical size of the symbols while preserving the message. This is the same magic used by the Huffman Coding Procedure. While a graduate student at MIT in 1952, David A. Huffman wrote a term paper in which he developed an elegant method for reducing the digital size of characters in a file. Huffman Codes are now used in all types of digital technology including FAX transmissions, WiFi, and HDTV.

Another method for compressing sequential data was developed by Israeli scientists Abraham Lempel and Jacop Ziv in 1977, termed LZ77, in which identical sequences of characters are replaced by references to their first appearance earlier in the file. LZ77 and Huffman Coding are components of the “gzip” compression algorithm currently used by the The National Center for Biotechnology Information (NCBI). Gzip compresses the text-based sequence files to around 25% of their original size, however, it is a general purpose compression algorithm that is not optimized for the contents of genome files.

When the genomic data is examined closely, it is evident that any two human genomes are more than 99% identical — less than 1% of our DNA is required to create the observable differences among individuals. Researchers from the laboratory of Professor Xiaohui Xie at the University of California Irvine (UCI) utilized this fact to develop a compression technique for genomic data in 2008. [1] Using a reference genome and four data-specific algorithms (including Huffman Coding) the UCI technique is capable of compressing a 3 GB genome file down to 4.1 MB, a compression ratio of 99.9%. This is an amazing feat that permits researchers to conceivably access and share genomic data files as email attachments.

However, the UCI technique does require around 4.2 GB of reference genome information and the encoding process uses position values that are relative to previous positions in the genome — meaning large portions of the sequence information need to be decoded completely before the data can be used. Research led by Dr. Waibhav Tembe at the Translational Genomic Research Institute have recently announced the availability of a new compression algorithm called Genomic SQeeZ (G-SQZ).[2] G-SQZ uses Huffman Coding that has been optimized for genomic sequence data. In addition to providing increased compression over the use of the generic gzip, G-SQZ allows the sequence data to be decompressed and retrieved from any position in the file — alleviating the need to decode the entire 3 GB genome.

The U.S. Department of Energy Human Genome Project logo lists six disciplines involved. New inventions and developments in each of these areas will continue to provide the technology necessary to spur research toward a better understanding of ourselves while creating additional opportunities for rewarding careers in problem solving.

[1] S. Christley, Y. Lu, C. Li and X. Xie, “Human genomes as email attachments”, Bioniformatics 25(2) 2009, pp. 274–275.

[2] W. Tembe, J. Lowey and E. Suh, “G-SQZ: Compact Encoding of Genomic Sequence and Quality Data”, Bioinformatics 2010;0:btq34v1-btq346 (Advance Access Manuscript).

________

This material originally appeared as a Contributed Editorial in HPC Source, August 15, 2010, pg. 26.

William L. Weaver is an Associate Professor in the Department of Integrated Science, Business, and Technology at La Salle University in Philadelphia, PA USA. He holds a B.S. Degree with Double Majors in Chemistry and Physics and earned his Ph.D. in Analytical Chemistry with expertise in Ultrafast LASER Spectroscopy. He teaches, writes, and speaks on the application of Systems Thinking to the development of New Products and Innovation.

Drinking from the Fire Hose

Written by William L. Weaver