The “Computational Storage” method in genomics : useful nor not?

CATHERINE COSTE
Biomedical Chronicles
9 min readFeb 28, 2019
My twitter account (Dec. 1st, 2018) @cathcoste

Why it matters: computational storage is a new way to mix algorithm with large sets of data, which could be used in computational genomics. Until now, fragments of data had to be brought to the algorithm, bits by bits, because of some critical bottleneck that is inherent in the storage process. Computational storage suppresses the bottleneck, allowing to bring the algorithm to the data.

This biomedical chronicle was written in French on Dec. 1st, 2018 and was updated on February 27th, 2019.

It is about bioinformatics, which is a multi-disciplinary field of research in which biologists, physicians, computer scientists, mathematicians, physicists and bioinformaticians work together to solve a scientific problem posed by biology. The specialist who works halfway between these sciences (biology, physics, maths, medicine) and computer science is called a bioinformatician. Looking for the “killer app” in genomics, using the new method of computational storage is happening in basic research.

Keywords: bioinformatics, data storage.

My twitter account, March 2, 2019

Whole genome sequencing today is still a science that is disrupting itself at a quick pace. Why is it not being currently and vastly used in clinic today, in precision genomic medicine (cancer, rare disease, genetic disease)? What is the main obstacle? Wet lab (stuff that can be found in biology labs), or dry lab (computer programming, algorithms)? Or both? Biology merging with computer science is still a new field, so when you come up with a disruptive method of data storage, you still have to figure out if it is going to be of any use in the process of digitising biology and medicine.

Computational storage is a new way to mix algorithm with large sets of data, which could be used in computational genomics. Until now, fragments of data had to be brought to the algorithm, bits by bits, because of some critical bottleneck that is inherent in the storage process. Computational storage suppresses the bottleneck, allowing to bring the algorithm to the data.

Today, we are able to digitise the living, including the human cell, but we still do not know how to casually sequence whole human genomes in clinic, and read it (and interpret it) from one end to the other. We have to fragment the data — usually DNA fragment sequences and protein structures. If you don’t need to bring fragments of data to the algorithm, and instead you bring the algorithm directly to the whole set of data that you need to study, it means you have to re-write the algorithms, or write new ones. Then I would guess that the question “is computational storage useful in genomics or not?” (DNA, cDNA, RNA) should be asked to bioinformatics specialists (fundamental research), as in this case, to use a new data storage method, you would have to write and use new algorithms. Before we do that, though, we need to assess the relevance of computer storage technology in genomics.

https://software.broadinstitute.org/firecloud
https://www.broadinstitute.org

The innovative storage solution offered by NGD Systems makes it possible to have all the data in storage; and so we can search all the data at once; no need to cut it into fragments as we did so far. Think of it this way: instead of having a woman who takes 9 months to make a baby, we have 9 women who, in just one month, will deliver a baby (Géraldine Van der Auwera, GATK, Broad Institute). Great. But is it useful? If instead of running 42 kms, you could win the marathon event with 42 runners each running one km, would that still be a marathon? I suppose not.

At NGD Systems, they make very fast hard drives, with integrated Linux, so that clients can install or deploy their own software.
The software, once installed, will give access to the entire hard disk - very fast and instant access to the entire hard disk, without the usual bottleneck: the input-output interface of the disk.

The client will have to install the software that will take advantage of this.
The constraint is that the processor has a limited power: it is that of a smartphone with a good processor.
As a result, parallelizable algorithms are needed, in order to take advantage of this new architecture or paradigm: providing quick (instant) access to large quantities of data, with limited computing power ; in a nutshell: the opposite of what we usually have.
So far it is not known if this disruptive “solution” is useful in genomics and computational biology. The “solution” is still seeking the problems that it might help resolve and this is where the expertise of bioinformaticians and data storage specialists is needed...

Now let’s take a closer look:

https://www.ngdsystems.com/blasting-the-analysis-of-protein-sequencing

“Computational storage" are hard drive embedded processors removing tasks from the CPU by making simple calculations such as indexing and other more complicated. An example of use in genomics: “BLASTing the analysis of protein sequencing

NGD Systems is a startup exploring disruptive ways to store data, with computational storage technology: a computer (like the one in smartphones) embedded in a hard drive, SSD type (solid state drives), so the new-tech hard disk (not the vintage one, with the small disk and head that are rotating inside). NGD Systems started with over 100 technical patents a couple of years ago ; is partially funded by the US government (and started thanks to private fundings). It's a startup in the sense that their innovative storage tech is working, yet the “killer app” remains to be found. Some of their prospects and clients are still looking, while some others may have found what they seemed to be looking for… They have a YouTube channel with videos that you can check out, as well as a website. To understand what computational storage is, and what kind of problems it can solve, you might want to start here:

https://youtu.be/CO4nisxil0I

Queries for bioinformatics specialists, on behalf of NGD Systems:

A) Is it possible to speed up genome sequencing thanks to the computational storage technology? Can we find new, or existing, or adapted algorithms which, together with “computational storage”, will help sequence more quickly? Bioinformaticians might start pondering if there are calculations for which they do not know how to make indexes (indexes are created in order to allow a more direct access to the data). But this is just a start…

French biologist Patrick Merel in San Diego https://www.sudouest.fr/2011/01/17/son-genome-sur-smartphone-292337-2780.php

French biologist Patrick Merel (Portable Genomics, California), 26th Nov. 2018 :

“ With Edico Genome (a startup that biotech giant Illumina Inc. in San Diego CA, recently bought) the genomic computing was in the hardware, why should we put it in the storage? Would it be any use? This remains to be seen.”

The bottleneck with the SSD type hard drive is due to the wire between the hard drive and the computer; with the old type of hard drive there were two bottlenecks: the wire, plus the “head” that was spinning in the hard drive. With the SSD type hard drive, there is a bottleneck between short-term and long-term memory. This requires time-consuming gymnastics: all programs copy the data they need to perform a search or calculations from the disk (long-term memory) to memory (RAM / short-term memory). It is therefore necessary to copy, and do the calculations / research, then copy the result (s) to the long-term memory. RAM = random access memory, it is an immediate memory (alive), it consumes a lot of current, is expensive and has no storage capacity, as soon as one turns off the computer, whatever data was in the R.A.M. is gone. It is only used during calculations. A genome would not fit in, it goes without saying.
In the computer, there is a processor with a very small memory. It adds a RAM that is very fast, very close to the processor, the processor can access this memory anytime anywhere (that’s what “random” means: anytime and anywhere, and not “anything” as one might imagine). There is therefore a bottleneck between the calculation and the data because of the “pipe” between the hard disk and the computer, between the storage and the calculation (let’s say this for the sake of simplifying things).

Merging processor and storage to avoid any bottleneck would be too expensive, we do not know how to do this today because the research would be too expensive. Instead, we put a computer that already exists (that of smartphones) in the hard drive, SSD type, to "overcome" one of two bottlenecks (so there is one left). Today, to overcome this bottleneck, computer scientists create indexes to directly access the data they want (the index serves no purpose other than to overcome the bottleneck remaining or persistent).

If we can come up with new algorithmic solutions (or recycle those existing), we will have the ability to cross large sets of data very quickly, in all directions. This might be an interesting opportunity in genomics

If the computer in the hard disk allows to squeeze the “copying” phase, so if we could get rid of of what is hindering us (the remaining bottleneck) by putting the computation part in the storage, we would still have to invent new (or recycle existing) algorithms, as today’s algorithms are optimised for the case where storage is not computational (the most common case today, by far and large). If we can come up with new algorithmic solutions, we will have the ability to cross large sets of data very quickly, in all directions. This might be an interesting opportunity in genomics, but it still needs to be confirmed and, last but not least, implemented…

We accept an imperfect “wet lab” because the “dry lab” compensates for this imperfection, making it disappear

B) The other question to ask is much more prospective, science-fiction oriented: in the sequencing and medical tests, the “wet lab” job is the most expensive part, whereas the “dry lab”does not cost much. What if we could “distort” the “wet lab” a little bit, while offsetting with the powerful tools that we have in the “dry lab”, in order to save money? Let me explain: a shared “wet lab” (lots of people together, mixed, grouped) would get everything all mixed-up. A recipe for disaster. However, we do have powerful tools to sort things out, thanks to the “dry lab” (computer programming, R stats, Python language). In other words, the “dry lab” could be compensating for the mess that we knowingly created in the “wet lab”, in order to save money. Wouldn’t a “computational storage” solution of some kind come in handy here?

Let’s take an example in the field of photography. In the days when there were negatives, we had to have “faithful” (correct, accurate, true) camera lens, that is to say that the assembly of lenses in the lens had to reproduce an accurate image on the negative. There was this obligation, or liability, to have a “faithful” lens (attached to the camera).

Today, the negative is replaced by an electronic sensor. So the lenses that we sell — for photo enthusiasts, who buy them as spare parts and fix them to their camera — distort the image! Now, in this story, replace the lens with the “wet lab” and see what I’m getting at. We accept an imperfect “wet lab” because the “dry lab” or computer calculator in the camera compensates for this imperfection, making it disappear. We can afford to produce cheaper and smaller lenses because we have a “dry lab” that is so powerful that we can settle for an imperfect “wet lab”… We do so knowingly: we use an imperfect “wet lab” to achieve substantial savings (and substantial savings are always obtained by diminishing the “wet lab” costs; dry lab is cheap).

The “computational storage” technology is new. It will be interesting to follow developments in bioinformatics, and especially new implementations. At this stage, the technology of genome sequencing has been disrupting itself almost every other day.

The plethoric repetitions that the human genome is made of suggest it is possible to compress the size. Repeating is often a way to avoid reproductive errors but genomics experts will have to validate these hypotheses.

My twitter account, March 4, 2019

Interested in knowing more about computational storage technology? Do not hesitate to get in touch with Scott Shadley, Marketing Executive with NGD Systems.

twitter, Feb. 2019

--

--

CATHERINE COSTE
Biomedical Chronicles

MITx EdX 7.00x, 7.28.1x, 7.28.2x, 7.QBWx certified. Early adopter of scientific MOOCs & teacher. Editor of The French Tech Comedy.