Introducing vapor: a high availability ROS 1.x master

Alan Meekins
Jan 8, 2019 · 4 min read

Vapor-master is a drop in replacement for rosmaster enabling high availability ROS service discovery. Vapor removes the single point of failure fundamental to ROS1, enabling new options for achieving scale without sacrificing stability.

Image for post
Image for post
vapor uses mongodb replication to work through failure

Embedded Ready and at home in the cloud

Vapor is well suited for use in ROS solutions requiring high availability such as assembly lines, mobile robots, and cloud services.

In assembly lines, Vapor enables hardware developers to achieve hot swap capability in high complexity equipment.

In mobile robots, Vapor enables developers to accomplish high speed handover in mission critical software stacks using commodity compute nodes.

In the cloud, Vapor enables ROS work loads to scale to the size of data centers.

With Vapor there are no limits in your way and you don’t even have to modify legacy code.

To get started follow the project’s README.

ROS1 single point of failure

Image for post
Image for post
A typical ROS1 topology

If you’ve used ROS in automation applications requiring multiple robotic agents or more than one compute node, you’ve likely experienced the limitations imposed by rosmaster’s single point of failure.

Only one computer can run the rosmaster, and in ROS 1 your only option is to choose wisely and hope luck is on your side. Any unexpected failures on the selected compute node can quickly bring your automation to a halt.

In an ideal world, you’ll never experience a node failure.

Image for post
Image for post
In ROS1 a single point of failure stops all new communications :(

Sadly, that’s just not the world we live in. Unexpected failure happens for many reasons:

  • Memory Exhaustion
  • Network Fault
  • Power Fault
  • Hardware Failure
  • Operating System crash

In ROS1, an uncommunicative roscore is a show stopper. Suddenly none of the working compute nodes can accomplish service discovery, they can rapidly be orphaned with no way to elect a leader and no clear process for what to do if the roscore does come back online. Parameter lookups immediately begin failing. No new topics or service connections can be established while the roscore is away.

Even if the compute node running roscore recovers, it may have lost all rosgraph data as it is only stored in memory. Rebooting the failing compute node will result in a network partition where the roscore forgets about the remaining compute nodes, even though they may actually still be in a working state.

Image for post
Image for post
ROS1 failure recovery? Reboot everything >.<

The only solution to this messy outcome, is to reboot all computers or restart the ROS software. Obviously downtime sucks in the cloud, but in mobile robots and assembly lines, poor failure recovery behaviors can be dangerous.

Flying a drone, driving down a street, or making hamburgers, are tasks that can be disastrous if stopped mid-process!

Vapor makes ROS1 resilient

When failure strikes in a Vapor enabled solution, the failure only affects the failing computer and its direct peers. The other nodes are able to continue using their local Vapor instances to accomplish service discovery. Importantly, no parameter reads or writes are lost and any new or existing topics and services can continue operating and be recontacted when the failing node is recovered.

Winning!

Image for post
Image for post
With Vapor, failure is localized and contained

Since all instances of Vapor can be synchronized via a mongodb replica set it is possible for just the failed compute node to rebooted and recover from failure rapidly.

Image for post
Image for post
Vapor recovers from failure quickly

Once the replacement is online it can rejoin the mongodb replica set and will automatically sync rosgraph changes that occurred while it was away.

Getting started

Vapor is Open Source built by ROSHub

The ROSHub team is hard at work on cloud platforms that help ROS developers scale faster and accomplish more.

Need help scaling your ROS solutions?

ROSHub

The robotics cloud

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store