Making Cloud Computing a Viable Option for Scientists
The world of science has changed, and there is no question about this. The new model is for the data to be captured by instruments or generated by simulations before being processed by software and for the resulting information or knowledge to be stored in computers. Scientists only get to look at their data fairly late in this pipeline. The techniques and technologies for such data-intensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm for scientific exploration. — Jim Gray on eScience: A Transformed Scientific Method (2007)
Imagine that you are a scientist studying seismic waves generated by earthquakes. You have lots of instruments that are collecting data right when an earthquake strikes. Because of the volume of data generated just from a single earthquake, you need high performance computers to wrangle with the data and perform some analyses to determine its intensity or the exact location of the epicenter. You must have access to a university- or government-owned cluster. The investment to have a computing cluster acquired and installed is at hundreds of thousands (if not millions) of dollars; not to mention the maintenance and operational costs. You won’t probably have one in your own lab because, otherwise, it will be sitting idle most of the time since there are only a few earthquakes in a year. It won’t be a very wise way to spend your research funding.
In another scenario, imagine that you lead a team of scientists who study the outbreaks caused by antibiotic resistant strains of bacteria. You would want to get samples of the strain when one such incident happens. You would want your team to sequence its genome and see how it evolved its resistance and know exactly which mutations did it have that made it more infectious or more resistant that its parental strain. Again, in this case, you don’t get too many outbreaks in a year.
The traditional setup of having to outsource the computational jobs to state-owned computing clusters is fine. It has been serving the scientific community quite well. In fact, for some really large-scale computationally intensive projects, pooling resources from governments, companies, and well-funded institutions is the only solution. Think about the LHC (Large Hadron Collider) project, for instance. For many small labs however that have occasional needs to scale their computational capacity, cloud computing is becoming a viable option.
The main barrier towards the mainstream adoption of cloud computing by the scientific community is the cost. The good news is: the prices for raw computing power have been going down as a result of competition between major cloud providers. However, the cost and the hassle of data transfer can become prohibitive for use cases where huge volumes of data have to be moved to the cloud although there are also signs of improvements in this space. Recently, Amazon Web Services has introduced their Import/Export Snowball service, which is deemed to be faster, cleaner, simpler, more efficient, and more secure compared to their older data transfer models.
In any case, cloud computing is poised to play a big role in realizing Jim Gray’s e-science vision. Having an access to inexpensive compute cluster in the cloud is one thing, having the softwares and tools running on them is quite another thing. To really have an efficient e-science environment built on top of the cloud, we need softwares that will abstract away the complexities of doing scientific simulations, data crunching, and analyses in the cloud and accessible through the web browser. In as much as we have made easy-to-use high tech instruments for collecting or generating data, we should also develop user-friendly software platforms that can scale.
The whole business of going from an instrument to a Web browser involves a vast number of skills. Yet what’s going on is actually very simple. We ought to be able to create a Beowulf-like package and some templates that would allow people who are doing wet-lab experiments to be able to just collect their data, put it into a database, and publish it. This could be done by building a few prototypes and documenting them. It will take several years to do this, but it will have a big impact on the way science is done. — Jim Gray
Along with two other co-founders, we have embarked on an adventure to develop scientific computational platforms in the cloud through our fledgling startup company — NanoTechGalaxy. From our previous experience in data-intensive science in labs with limited access to university or institutional compute clusters, we have come to realize the potential of cloud computing for improving how we do science. But since existing tools are built to run in desktops or in command-line interfaces to computing grids, the scarcity of ready-to-use software solutions may in fact be a more immediate barrier. Big technology and e-commerce companies like Facebook, Amazon, Walmart have managed to create browser-based user experiences that hide the complexity of running background tasks in the cloud but no tech company (or a handful few) have created similar tools for scientists. As Jim Gray pointed out, the whole business of going from instruments to the web browser involves a vast number of skills but what’s going on is actually very simple. From our point-of-view, we just need to create a system for on demand creation/tearing down of compute clusters (regardless of the number of cores), a system for submission and execution of background tasks in those clusters and an efficient way to manage the storage of input and output data. All these built from the ground up to leverage the scalable nature of the cloud. On top of those systems, we can develop powerful tools for visualization, exploration and analysis.
Commercial organizations like Walmart can afford to build their own data management software, but in science we do not have that luxury. At present, we have hardly any data visualization and analysis tools. Some research communities use MATLAB, for example, but the funding agencies in the U.S. and elsewhere need to do a lot more to foster the building of tools to make scientists more productive. When you go and look at what scientists are doing, day in and day out, in terms of data analysis, it is truly dreadful. — Jim Gray
So we sat out to create a web-based scientific tool/platform to serve as a proof-of-concept. It’s not that this is totally novel but we just wanted to know if we can manage to create something better in terms of efficiency in leveraging the power of the cloud and in terms of significantly improving the user experience. We chose to work on a platform for doing virtual screening through molecular docking. The choice was made based on our familiarity of the process and the relative ease of data management. I have set up virtual screening runs during my PhD using a quad-core iMac in our lab so I am familiar with the practical aspects of it; and I understand the biology behind the process. Also, the volume of data to manage/transfer is not huge as compared to a genomics platform.
Molecular docking is a process of identifying molecules that have a potential activity against a protein, either to inhibit or enhance its function. This is done as part of pre-clinical hit identification process in a structure-based approach to drug discovery. Basically, it’s like being given a lock and you have to find a compatible key out of millions of keys. The key here is analogous to a small molecule (that is, a drug) and the protein corresponds to the lock. Molecular docking can give you a short list of small molecules that have the potential to inhibit your protein of interest, which you can then test in the lab. Since the docking software attempts to model the actual biological interaction, it will give you a short list that is way better than a random selection from a database of millions of molecules.
We have decided to build the platform on AWS (Amazon Web Services) since it is the cloud provider with the most complete application programming interface. By automating the elastic provisioning of servers in the cloud to provide a dedicated cluster for every user on the fly; then by creating a web interface for submitting jobs and executing them in the background, we have provided a base system upon which to build our virtual screening platform. Furthermore, we have made the task execution robust and fault tolerant so we can use spot instances in AWS. The spot instance is a pricing model in AWS where they provide their unused servers for a discounted price (up to 80% less compared to on demand pricing model).
We were able to make the whole experience browser-based including the molecular visualizations during the job submission and in viewing the docking results. For one of our pilot projects for a client, we were able to spin up 440 cores and perform about 42,000 dockings per hour, which is around a million compounds screened per day. Doing virtual screening in the quad-core iMac from my previous lab, we can only manage to screen about 14 thousand compounds per day. Using our platform, one can easily scale up to thousands of cores with a click of a button without disrupting any currently running screening jobs.
Our team has been actively developing the base system and the docking platform built on top of it. In particular, we have been incorporating elements from AI (artificial intelligence) through deep learning to speed up the screening by conventional docking programs and even to develop our own in-house docking algorithms with it. However, the merits of virtual screening through molecular docking is still being hotly debated. In particular, the interaction model used by popular docking programs are quite simplistic and are not appropriate for certain types of proteins (e.g. some enzymes). It is an active area of research nonetheless and we could see mainstream adoption of routine virtual screening within the next 5 to 10 years.
For now, we are moving into the genomics space, which has its own set of challenges in terms of implementation but genome sequencing technologies are now considered mature enough to provide precious biological insights. It’s now becoming a routine for labs and hospitals to do whole-genome sequencing for research and for diagnostics. The processing and analysis of the terabytes/petabytes of genomic data is now considered the bottleneck. Having our proof-of-technology that we can indeed use the cloud for creating a platform where any scientist can spin up a cluster at any time and at any scale then initiate the jobs with as less friction as possible, we think we are ready for the challenge.
The quotes from Jim Gray were derived from:
Hey, Tony; Tansley, Stewart; Tolle, Kristin. The Fourth Paradigm: Data-Intensive Scientific Discovery (2009). Microsoft Research. Kindle Edition.