Using Kubernetes for Simulation Infrastructure Automation

Lawrence May
Gauntlet
7 min readFeb 13, 2020

--

At Gauntlet, we’re building a platform that allows users to build robust financial models of decentralized systems and smart contracts. Modeling a decentralized system means making assumptions about how the participants in the system will behave. Since these assumptions can dramatically affect the behavior of the model, we test a wide range of assumptions in a simulated environment. Simulation results allow us to discover how robust the protocol might be and to quantify tradeoffs between different parameterizations. For example, the safety of many DeFi protocols is driven by the price of assets used as collateral. To see how protocols fare across worst case asset volatility, we use simulation to examine tens of thousands of historical and generated price trajectories. How do we efficiently and repeatedly spin up these simulations? How do we keep maintenance manageable?

To create flexible simulation infrastructure, we developed purpose-built automation using Kubernetes. We’ll cover how we designed our solution, what other options we considered, and what the future looks like.

Background

Clients interact with their simulations via our Python SDK, mainly through hosted Jupyter notebook environments. These clients use the SDK to make requests to our backend system, which initiates simulations covering the entire specified parameter space. Simulations are run in parallel on a cluster of machines. Each simulation backend needs to be paired with its own blockchain instance for maintaining contract state, allowing us to run simulations directly on “production” contracts.

The initial version of our system consisted of a fixed sized cluster that ran identical services on each node: two Kubernetes pods (pod = a set of colocated containers), a gateway to handle incoming requests, and a worker pod to run simulations. When a gateway receives a simulation request, it expands the parameter space and generates simulations for all relevant parameter combinations. These simulations are then dispatched to the worker pods. Each worker pod has access to a simulated blockchain which it uses to run simulation operations using real contracts. Running the simulation across nodes enabled us to reach our primary goal for the initial version of our infrastructure: parallelism. When the simulation finished, the gateway would respond to the user with the results.

Our old infrastructure

Despite running jobs in parallel, the legacy cluster displayed a few shortcomings:

  • A fixed cluster size meant throughput didn’t scale with demand i.e. throughput of the cluster was bottlenecked by the number of nodes in the cluster. In addition, if a node was sitting idly and not running a simulation, we would still incur the cost of running it. Under this system, we would pay for our peak usage at all times.
  • Since the worker pods were deployed as a DaemonSet to every node, requests were randomly routed, resulting in some nodes being busier than others (we had planned to create another service to manage the state on the cluster or implement a timeout/retry mechanism.)
  • Running a stateful blockchain instance on each pod created another issue, state needed to be reset after each simulation run. Also, the homogenous nature of our nodes meant running a backend with a specific blockchain required running a completely different cluster in parallel (i.e. if we wanted to run ganache and geth based simulations.)
  • Since our workers were stateful, lost TCP connections or lost nodes meant lost results. If a client’s connection to our gateway was interrupted, they would never receive the results of their simulation. A disconnection would occur in a number of ways including WiFi connection losses and/or timeouts that arose from waiting for a long running simulation.
  • Memory leaks relating to state degraded performance over time. We ran into an issue where a blockchain restart wasn’t fully freeing used memory.
  • Node failure would cause unpredictable request behavior, thrashing the request queue.

New simulation requirements

Having achieved some level of parallelism and being aware of the issues with our old infrastructure, we decided at the beginning of Q4 2019 the next phase of development would focus on scalability and reliability.

Concretely, this meant prioritizing two things:

  1. Auto-scaling infrastructure, since running discrete nodes on a fixed cluster caused bad utilization
  2. Storing simulation parameters, status, and results in a database, so we would be resilient to job failures, which were too costly

We expected a bursty demand profile for simulations, so auto-scaling infrastructure would mean nodes spin-up to meet demand, decreasing our cluster footprint, thereby minimizing cost. While our previous system returned results to users directly, storing results of completed runs would allow us to reference them in the future, but more importantly clients wouldn’t need to maintain continuous connections to our gateway to receive results, and can still receive them when gateway nodes fail.

Potential Solutions

There are many solutions out there that achieve the above requirements. Although our pipeline closely resembles a MapReduce-like system, running big data infrastructure like Hadoop didn’t fit our use-case. Hadoop and other similar systems are designed for crunching a very large quantity of data in parallel, but our simulations are not operating on data and have dependencies on external systems that are unsupported.

Since we depend on Jupyter Notebooks, we considered leveraging a distributed machine learning framework. We considered Kubeflow paired with some type of workflow coordinator (e.g. Argo, Airflow, Pachyderm, etc.).

We knew we wanted something offering minimal maintenance for a small team. Many of the above systems are used to support complex, multi-stage workflows. Since we had some institutional knowledge around Kubernetes already, it was natural to try and build on this similar foundation than to rebuild from scratch. Many of these frameworks are built on top of Kubernetes, but we didn’t feel that our workflow was complex enough to justify adding an additional layer of overhead.

Kubernetes Jobs

The Kubernetes job feature came to our attention and we decided to investigate it. Unlike DaemonSets, StatefulSets, and Deployments that are meant to keep services up, Jobs are meant to run once until they succeed. This is a great abstraction for our use case!

Jobs allow us to start a new pod for each parameter combination in a parameter search. In theory, we would start jobs then wait for jobs to report their status to us. The gateway must be kept stateless so it doesn’t have to track job progress or wait for jobs to complete, it simply checks the current state when we ask. Jobs would also be stateless and start from a blank slate every time. They would pull simulation parameters from a central store as input into the simulation and upon completion, report results then terminate. This eliminates the need for long running pods whose state needs to be reset after each run. Moreover, we could leverage auto-scaling to operate a minimally sized cluster that would scale on demand and handle thousands of simulations while linearly increasing our costs.

The migration to stateless services reduced our operational costs considerably. Misbehaving jobs can easily be killed and restarted. Bad upgrades can be trivially rolled back. Now that we don’t have to deal with issues related to running a long-lived instance (e.g. memory leaks), our data scientists can focus on simulation.

Versioning simulations was also easier with a job-based infrastructure. To update a simulation backend, all we do is upload a new image for that backend and reconfigure the gateway to use it for subsequent jobs. Kubernetes namespaces also allowed us to easily maintain multiple environments for testing, development, and CI in addition to production. Upgrading is more fluid, where previously it required a restart of all the pods on the cluster.

Since we now store results in a database, any previous simulation results can be accessed at any time. Using gRPC streaming in addition to Firestore for caching allows to have the best of both worlds: responsiveness of streaming for job progress, while being able to resume a connection. If a simulation has already completed, results are returned immediately. Resumability allows our data scientists to work asynchronously by starting a simulation, closing their laptop, and checking progress at a later time and/or starting a different simulation at any time. This is all possible because the API of the gateway is idempotent — subsequent identical requests will have no effect. The same operation used to create a simulation is used to polls for results or return results.

Our current architecture

Future Work

We’re looking at ways to improve job submission to Kubernetes. Submitting a batch of a thousand jobs can take minutes. There is a Kubernetes feature proposal called “Indexed Jobs” which would give us native support for running a large number of related jobs, but sadly there hasn’t been much activity recently. This pushes us closer to spinning up our own intermediate service to handle job submissions.

Something we could look at in the future is trying out the entire pipeline in GCP Cloud Functions to further reduce the amount of state.

Leveraging Kubernetes has worked wonders to lower our operational overhead. With an appropriate use case, implementing a job-based system could be a big win for other developers faced with similar problems.

This system illustrates the advantages of Kubernetes and in particular, the power of Jobs! Kubernetes is a wonderful piece of technology that solves a litany of common infrastructure problems. We continue to learn, and it’s been exciting to see what services we can use straight out-of-the-box that would previously have required many engineers to build and manage!

--

--