At GRAIL, our mission is to detect cancer early, when it can be cured. Our approach is a particularly data intensive one. We collect genomic sequencing data from large cohorts of study participants. These data have to be processed in computationally intensive ways, and then analyzed en masse to train models that can predict an individual’s cancer status.
This approach to building models is iterative: during development, researchers must test their ideas on real data in order to gain more insight, yielding further improvements. This kind of development benefits from quick feedback cycles; researchers must be able to execute on their ideas quickly, and learn from results. The approach is viable only if we can perform the required data processing in a timely and efficient manner.
At GRAIL, we use the Go programming language for most of our bioinformatics, data processing, and machine learning tasks. While a somewhat unconventional language for these domains, we feel well served by our choice: Go’s simplicity makes it easy for newcomers to learn; its transparent runtime semantics makes it easy to reason about performance; and its ability to control data layout and allocation makes it possible to write highly performant data processing code.
However, the Go ecosystem lacked a tool for large-scale data processing. Bigslice is that tool.
Bigslice is a library for data processing in Go. Bigslice provides a coherent set of operators that helps the user efficiently compute over large data sets using ordinary Go code. While the user’s computations are sequential — they specify how a dataset is to be transformed, step-by-step, into the desired result — Bigslice parallelizes the computation and can distribute it across many processors and over large compute clusters.
Underneath the hood, Bigslice splits the datasets into many smaller pieces, and performs these transformations individually on each piece so that they can fit in memory, and so they can be performed in parallel across many machines. When transformations require that data be rearranged (for operations like join or reduce), Bigslice arranges that the data are re-shuffled accordingly.
Bigslice is an ordinary Go library. Users can, for example: write their own libraries that use Bigslice; use Bigslice inside of a program’s main function; or write unit tests. By default, Bigslice will distribute its computation among the available processor cores on the machine it’s running on. This is useful for testing and development, but most use cases of Bigslice require clusters of machines to perform the task at hand. For this, Bigslice can be instructed to self-distribute across compute instances in the user’s cloud provider. Bigslice uses Bigmachine to manage an ad-hoc cluster of compute nodes to support distribution. Once the job is done, the nodes are torn down again; Bigslice allocates only the compute that is needed for the job. Since Bigslice also provides fault-tolerance, it can provision cheaper but less reliable instances (for example, from AWS’s EC2 spot market); Bigslice transparently manages recovery when and if needed.
Using Bigslice to distribute computation requires no additional infrastructure. Through Bigmachine, Bigslice manages all of the underlying infrastructure concerns. As long as cloud credentials are available in the user’s environment, Bigslice manages all of the details. In this sense, you could call Bigslice a “serverless” data processing system.
As with Bigmachine, Bigslice goes to great lengths to cohere with the existing Go ecosystem and tooling. Even when performing data processing on a large cluster of machines, Bigslice acts like a single process. A particularly useful aspect of this is that the user can retrieve runtime performance profiles across the full cluster, making it easier to understand why a particular job may not be performing as well as the user expected.
At GRAIL, we are using Bigslice extensively across many applications, including:
- data processing and ETL;
- generating datasets for machine learning tasks;
- bioinformatics pipelines;
- training machine learning models; and
- a high level query language for genomic data.
We believe that Bigslice is now in a state that will be useful to a much broader audience. As of today Bigslice is open sourced at https://github.com/grailbio/bigslice, and https://bigslice.io/. We are looking forward to accepting contributions from the greater community.
Thanks to Nick Schrock, Jaran Charumilind, Oggie Nikolic, and Demetri Nicolaou for feedback on this post.