Optimising cloud compute costs for complex transport simulations

Moving from AWS On-Demand to Spot Instances in EC2

Rory Sedgwick
Arup’s City Modelling Lab
5 min readApr 15, 2021

--

Much of the work done by Arup’s City Modelling Lab involves running complex transport simulations, largely underpinned by the open-source MATSim modelling software. Running a simulation can require large input and output data volumes, as well as significant compute resource. Utilising cloud services brings opportunity to scale up simulations in terms of fidelity and parallelisation, as well as reducing execution time. But these potential improvements likely incur additional cost. So how do we effectively scale our simulation capacity, without breaking the bank?

Photo by Michael Longmire on Unsplash

Running simulations incrementally

MATSim involves simulating the behaviour of individual agents within a transport network over a single day. Multiple iterations of the same day are simulated, with agents “learning” from previous day’s plans and outcomes, making it advantageous to run simulations for many iterations. Default MATSim behaviour is to run until some key measurements begin to stabilise, but we have tooling, known as BitSim, that allows incremental simulation execution, with introspection & status reporting at select intervals.

BitSim defines a sequence of tasks to iterate through, centred around the core MATSim engine, but also including pre- and post-processing of data, and simulation status feedback. This is done using AWS Step Functions to create a state machine describing the various tasks involved, which gives us a framework for iteratively stepping through a simulation incrementally.

Fig 1. A BitSim model simulation pipeline broken into incremental steps, with feedback loop. Each step comprises a single task submitted to AWS Batch

Breaking a simulation into steps this way gives greater flexibility to stop the model early if some conditions are met or failed, as well as offering an opportunity to reassess the choice of compute infrastructure being used to run each task. There is now a drastically shorter feedback loop than previously existed.

Incremental simulation execution also provides a flexible framework in which we can include other modularised operations and models. We can define more complex stopping criteria for simulations, post process outputs for additional analysis or introduce additional modelling tools into the simulation itself.

Cloud compute: On-Demand vs Spot

To execute each step in the BitSim simulation pipeline, a containerised task is submitted to an AWS Batch Job Queue, which runs it in a suitable Compute Environment. The one or more virtual machines that comprise this compute environment will be provisioned in response to the requirements of the task, largely determined by the model input resolution and complexity. We also have the option to assess the availability of cheap compute resource.

Fig 2. AWS Batch service interactions, courtesy of AWS

As well as the default “On-Demand” pricing model for virtual machines, AWS also offers spare compute capacity at significant discounts, caveated by the potential for instances to be reclaimed by AWS if customer demand spikes. These are known as Spot instances. Using these effectively requires fault-tolerant workloads, meaning an interrupted individual task will not cause the whole process to fail

Spot & BitSim

BitSim’s incremental execution approach means we have “save points” part way through a simulation. Since each task starts with a read from, and ends with a write to, the shared persistence layer, we can treat them as atomic, and will never have partial outputs from a single iteration. Even if a task is interrupted, it can be rerun and start from the last known “save point”, determined by the last set of outputs.

What happens when a task is interrupted?

  1. The instance running the task is notified of impending shut-down
  2. The task exits, triggering retry logic within the BitSim simulation pipeline
  3. BitSim re-submits the task to the Batch Job Queue
  4. The Batch Job Queue decides where to run the task, including potentially provisioning a new instance if none suitable exist

This final task placement decision factors in live instance costs & their chance of being interrupted, rebalancing the simulation towards instance types that are less in demand. If no Spot instances are available (due to continued surge demand), there is an optional fallback to On-Demand instances to guarantee the task runs.

By using this simulation pipeline to partition one large, compute-intensive modelling task into incremental stages, we have created an opportunity to continually reassess simulation status, and optimise use of compute resources based on current availability of low cost infrastructure.

Fig 3. Typical Spot instance saving summary for a simulation

Summary

The use of Spot instances can bring massive savings to compute costs, dependent on the expectation that some tasks will be interrupted and therefore the overall pipeline is able to gracefully manage errors. The fact that simulations are already broken down into incremental steps, and there is a framework in place to make decisions around task placement and underlying instance requirements at each step, means that Spot instances are a great fit for bringing large potential cost savings with minimal chance of impact on simulation execution.

Fig 4. Daily simulation compute costs as the team switches to using Spot instances instead of On-Demand

Reducing cost by such a significant degree means it is possible to:

  • run higher iteration simulations
  • run parallel simulations of the same model, providing greater confidence in aggregate outputs of what is ultimately a stochastic process.
  • run multiple simulations with varying parameters to improve model calibration

These enhancements mean we can potentially tackle more complex questions posed by clients, and explore problem spaces not previously feasible.

Want to know more about the work being done by Arup’s City Modelling Lab? Check out the rest of their Medium articles

--

--