Self-managing serverless computing with Bigmachine

Marius Eriksen
grail-eng
Published in
3 min readOct 2, 2019

Building systems on modern cloud infrastructure can often feel like an exercise in systems integration rather than systems building. Cloud providers supply powerful components that, with a little code here, a dash of configuration there, can be put together in countless ways to achieve the task at hand. Mostly this is good — a modern version of the vaunted software components, providing a high degree of reusability and specialization — but it can also get in the way: code can become diffuse and difficult to reason about as a whole; and it can seem like we spend more time provisioning, configuring, and monitoring infrastructure than we do solving the actual problem at hand. Worse, to reason about a system often requires juggling multiple layers of systems, most of them opaque to you.

Bigmachine is an attempt to reclaim programmability in cloud computing.

Bigmachine is a Go library that lets the user construct a system by writing ordinary Go code in a single, self-contained, monolithic program. This program then manages the necessary compute resources, and distributes itself across them. No infrastructure is required besides credentials to your cloud provider.

Bigmachine achieves this by defining a unified programming model: a Bigmachine binary distributes callable services onto abstract “machines”. The Bigmachine library manages how these machines are provisioned, and transparently provides a mutually authenticated RPC mechanism. From the perspective of the programmer, a machine is an object that represents code running in a different address space, serviced by a set of user-defined methods that may be called from any other machine in the cluster.

Bigmachine tries hard to preserve the advantages and simplicity of local computing. For example, when the user queries any of Go’s profiling endpoints, e.g., /debug/pprof/heap to retrieve a heap profile, Bigmachine returns a profile merged from all the machines currently under management. Likewise, anything that is written to standard output or error (e.g., log messages) are written to the local standard output or error, prefixed by the machine name. Likewise of any monitoring variables exported by the standard expvar package.

Because Bigmachine provides an abstract model of cluster computing, the same code can be run on a simulated cluster simply by forking many local processes. This is especially useful during development and testing. For example, you can write a single unit test for a distributed system, and even simulate poor network connections and let loose some chaos monkeys! (Bigmachine provides testing utilities for this.)

Since Bigmachine is just a library, it is also simple to extend and use to build higher level services. For example, we built in a simple time series database and visualization, so that you can automatically get visualization of pertinent time-series data like memory, CPU, and disk utilization. (Or any user-defined variables, as exported by Go’s expvar package.)

At GRAIL, we’re now using Bigmachine to power two of our distributed computing platforms, which we plan to open source soon.

The first is Bigslice, which is a cluster computing system in the style of Spark. Bigslice is a Go library with which users can express high-level transformations of data. These operate on partitioned input which lets the runtime transparently distribute fine-grained operations, and to perform data shuffling across operation boundaries. We use Bigslice in many of our large-scale data processing and machine learning workloads.

The second is Diviner, which is a black-box optimization framework for machine learning applications. Users define a black-box process which takes a set of parameters (with predefined ranges) as inputs, and produces a set of metrics as output (e.g., accuracy on a test set). Diviner then searches the parameter space in parallel, using Bigmachine to distribute executions (these can be expensive, e.g., training a deep neural network) over as many machines as are required for the task.

Thanks to Nick Schrock, Jaran Charumilind, Oggie Nikolic, and Demetri Nicolaou for feedback on this post.

--

--