This week … 2017–1–31
Its been an interesting week in the world. I can’t do justice here to the disaster that is the executive order Muslim Ban, but suffice it to say that the lives and livelihoods of dear and extraordinarily talented friends in and outside academia are threatened, for no appreciable reduction in risk for anyone. It is bald-faced cruelty. If you haven’t already, I urge you to donate money to your favorite legal charity such as the ACLU or Southern Poverty Law Center. I know I will do so.
Back to bioinformatics. Or, for right now, my life is mostly data management and structure design. Working with a petabyte of data is a logistical nightmare and economically high risk as bad decisions can cost my lab hundreds of thousands of dollars. So, lets go over the major costs.
- Storage : This is the single largest cost in our lab by far. In fact, other bioinformaticians estimate storage as 90% of the total cost of their work. Computation is relatively cheap, assuming that you can get the data to the CPU, which means
- Data transfer : Its cheap to get data in, expensive to get it out. Lets say I want to rerun haplotyping calling on the new GATK 4.0. Our lab already has dedicated compute resources that are bought and paid for, but the storage on our cluster is simply not available. So, to rerun the analysis, we have to pipe a petabyte of data down from the cloud and run the analysis on our cluster. Data transfer alone would cost tens of thousands of dollars, and that is to run a single analysis. Compounding problems, due to the limited on-cluster storage, we have to toss a BAM file every time we finish processing it. Fortunately, a new cluster on campus is coming online that has the storage capacity to can our data, but its storage costs are hundreds of thousands of dollars more than our best offer for storage.
- Data transfer/storage tradeoff : Most analysis I would want to do can get by on just using the gVCF or VCF files generated from GATK. But, what if I want to use a new deep learning algorithm to discover the functional changes due to mutation in the non-coding regions, lets say an algorithm like DanQ? I would need to download the files and run them locally on the cluster we own, or run them in the cloud and essentially let our compute resources on-cluster languish. Makes more sense than ever to simply run all compute operations on the cloud as data ingress/egress rapidly piles up for each analysis we would want to do.
More on this problem later this week. Oh, and Julia rocks.