Introducing vapor: a high availability ROS 1.x master

Published in

ROSHub

4 min readJan 8, 2019

Vapor-master is a drop in replacement for rosmaster enabling high availability ROS service discovery. Vapor removes the single point of failure fundamental to ROS1, enabling new options for achieving scale without sacrificing stability.

vapor uses mongodb replication to work through failure

Embedded Ready and at home in the cloud

Vapor is well suited for use in ROS solutions requiring high availability such as assembly lines, mobile robots, and cloud services.

In assembly lines, Vapor enables hardware developers to achieve hot swap capability in high complexity equipment.

In mobile robots, Vapor enables developers to accomplish high speed handover in mission critical software stacks using commodity compute nodes.

In the cloud, Vapor enables ROS work loads to scale to the size of data centers.

With Vapor there are no limits in your way and you don’t even have to modify legacy code.

To get started follow the project’s README.

ROS1 single point of failure

If you’ve used ROS in automation applications requiring multiple robotic agents or more than one compute node, you’ve likely experienced the limitations imposed by rosmaster’s single point of failure.

Only one computer can run the rosmaster, and in ROS 1 your only option is to choose wisely and hope luck is on your side. Any unexpected failures on the selected compute node can quickly bring your automation to a halt.

In an ideal world, you’ll never experience a node failure.

In ROS1 a single point of failure stops all new communications :(

Sadly, that’s just not the world we live in. Unexpected failure happens for many reasons:

Memory Exhaustion
Network Fault
Power Fault
Hardware Failure
Operating System crash

In ROS1, an uncommunicative roscore is a show stopper. Suddenly none of the working compute nodes can accomplish service discovery, they can rapidly be orphaned with no way to elect a leader and no clear process for what to do if the roscore does come back online. Parameter lookups immediately begin failing. No new topics or service connections can be established while the roscore is away.

Even if the compute node running roscore recovers, it may have lost all rosgraph data as it is only stored in memory. Rebooting the failing compute node will result in a network partition where the roscore forgets about the remaining compute nodes, even though they may actually still be in a working state.

ROS1 failure recovery? Reboot everything >.<

The only solution to this messy outcome, is to reboot all computers or restart the ROS software. Obviously downtime sucks in the cloud, but in mobile robots and assembly lines, poor failure recovery behaviors can be dangerous.

Flying a drone, driving down a street, or making hamburgers, are tasks that can be disastrous if stopped mid-process!

Vapor makes ROS1 resilient

When failure strikes in a Vapor enabled solution, the failure only affects the failing computer and its direct peers. The other nodes are able to continue using their local Vapor instances to accomplish service discovery. Importantly, no parameter reads or writes are lost and any new or existing topics and services can continue operating and be recontacted when the failing node is recovered.

Winning!

With Vapor, failure is localized and contained

Since all instances of Vapor can be synchronized via a mongodb replica set it is possible for just the failed compute node to rebooted and recover from failure rapidly.