Building a Production-Ready Robotic System

Published in

Shade Robotics

17 min readJul 26, 2022

Building a robotic product, from concept to manufacturing and release, is a very vast, involved process that requires the synergy of multiple moving parts. A crucial component is the development of production-grade robotic software that will ultimately be deployed on a fleet of robots and ingesting petabytes of data. Productizing a ROS 2 application can be a tricky course to navigate as there is a multitude of factors that define a system as “production-ready”. This article serves as a guide and discussion of the tools and workflows provided by the ROS Tooling Working Group that can be used to help you build, test, release, monitor, and deploy a ROS 2 package on a production robot: the modern RobotOps stack.

Adapting ROS 2 Workflow to ARM CPUs

Today, while most developer workstations run on x86–64 processors, their robots may use a different architecture. Specifically ARM has been popular for platforms upon which robots are built such as Raspberry Pi, Nvidia Jetsons, and more because of its reduced chip complexity which makes it more cost and power-efficient. In fact, over 70% of IoT and robotic devices on the market run on ARM instead of x86. As a result, when we’re building a ROS package, we need to be thinking about how to develop for ARM devices, therefore we must modify our development workflow to run ROS 2 on an ARM computer which a robot is built on.

Many developers just starting out approach this task by treating their robot’s computer as a smaller version of their workspace, building a peripheral workspace right on the robot. While this approach is suitable for early stage development and simulation based workflows and works well with testing code in the workstation, there are a few reasons why building on the target device is a flawed approach. First, computing systems may not always have all the physical access you need — ports needed to plug in peripherals can be hard to access or missing altogether because they weren’t intended to be used in this way, and if they are it can cause mechanical strain and shorten the robot’s lifetime. This approach is also not scale friendly as you can only deploy builds to one robot at a time (I mean seriously, imagine trying to deploy some code across a thousand robots manually). Further, there are storage, safety and performance concerns. The disk space used up by installing build tools and code on the robot could probably better used for things that actually add benefit. Each additional package installed on the robot increases its security vulnerabilities and it can be very dangerous when a hacker is able to access your entire code base rather than just its binary packages (we’ll touch on this later). Finally, many embedded systems don’t have nearly enough RAM to run the build process so it would be impossible to run it on these target machines. ARM processors are also not as fast as the x86 processors available in developer workstations or in the cloud.

This brings us to the solutions of cross-compilation, emulation, or containerization. These can be difficult to set up and maintain within a complex codebase, so The ROS Tooling Working Group provides a tool, cross_compile, which can build ROS 2 workspaces for both ARM and x86–64 with minimal set up. Cross-compilation means using source code to build binaries for a CPU architecture that is different than the computer where you run the build. In this case we’re talking about building ARM binaries on an x86 computer. Similarly, emulation is the reproduction of the function or action of a different computer, software system, etc. By building on the x86 machines you save a ton of space on your robot from build tools and source code, decrease security vulnerabilities, and save time on every change you make to your code which could add up to months or even years saved over the course of a robot’s development cycle. Finally, containerization, which is the most streamlined approach, offers both the ability to do CrossFit compilation and emulation without the issues of impacting your base OS on either systems. By containerizing specific systems and structures, security can be isolated to each container process rather than compromise the entire robot. The most common way to do containerization is via Docker. Docker provides a super easy way of declaratively defining the build strategy. This combined with Google Cloud Build’s buildx system enables us to actually cross compile our containers for any system. Check out our article here on setting cross compilation up with docker along with GPU acceleration of docker if you’re running on a Jetson or CUDA supported device.

Tuning ROS 2 Applications for Performance Using QoS

Supervising and managing the performance of your ROS 2 applications ensures proper function and promotes reliability of your system. A third dimension of the topics in the ROS publisher-subscriber system, Quality of Service (QoS), which is new for ROS 2 provides us the ability to achieve this. Traditionally, ROS built out their own messaging system under the hood which led to many security risks and almost no ways to customize QoS and typing. ROS 2 now sits on top of DDS, which is the standard messaging protocol across industry with built in QoS.

QoS defines extra promises about the behavior of your publishers and subscribers on topics and has a variety of settings and policies. Before diving further, take a moment to familiarize yourself with these QoS components:

QoS Policy — an individual QoS type or setting
QoS Profile — a complete group of all policies
QoS Offer — publishers offer a QoS profile — this offer is the maximum quality that the publisher promises to provide
QoS Request — subscribers request a QoS profile — this request is the minimum quality that the sub is willing to accept
Compatibility — QoS policies have compatibility rules. If an offered policy has a matching or higher quality than the request then the publisher and subscriber can be connected. A publisher can match and pass messages to multiple subscribers with different requested QoS as long as each request is compatible with the offer. Some policies affect compatibility while others don’t: reliability, durability, deadline, and liveliness do affect whether a pub and sub are compatible while history, depth, and lifespan do not as they are purely local configurations.

Developers will primarily interact with QoS in their application code. All publishers and subscribers must specify a QoS profile on creation. Using the ros2 topic utility, you can get various information about QoS of topics by running ros2 topic info —verbose, (use ros2 topic —help to see more available options). Additionally, the rosbag2 utility uses QoS to intelligently record and play back messages from a ROS 2 system. This data can then be visualized and used in a CI/CD stack for testing, for data ingestion, and machine learning.

A hazard of the extra promises from QoS is that they can be broken. This is when the concept of Events comes into play — events are generated when there is a change detected in quality of service that breaks policy, which triggers callbacks to be able to handle these cases. ROS 2 defines preset QoS profiles for common cases and these are a good place to start when tuning the QoS of a topic, but it is necessary to go beyond them to deal with more advanced behaviors. In general it is good practice to consider policies for every publisher and subscriber. Below we list and describe some of the policies you should definitely take into consideration to improve the robustness of your robotic application.

Incompatibility Callbacks — it is useful to have a mechanism defined to know when we encounter a mismatched situation (i.e. QoS profiles between pub and sub don’t match). Then for example, if an application is run where a subscriber asks for a Reliability of X but the publisher only offers Reliability Y (with X being of higher reliability than Y), then compatibility callbacks are triggered for both the publisher and subscriber.
Lifespan — we can use the Lifespan QoS policy to ensure message freshness by guaranteeing that our node will never publish messages older that the time we decide they are useful for. If messages in the outgoing queue are older than the defined Lifespan duration, the policy drops them.
Deadline — the Deadline policy in ROS 2 gives us watchdogs which trigger warnings if the gap between message arrivals becomes too large at any time. This is important because in a system with data regularly coming in it is useful to promise a specific frequency or period for messages on a topic. The Deadline is the promised maximum duration between messages, or the period for the publishing frequency.
Durability — the Durability policy defines whether we should provide old messages to subscribers that join late. The default Durability is VOLATILE, for which messages are no longer kept anywhere after being published. The other option is TRANSIENT_LOCAL which means a number of messages up to the defined Depth are stored locally on publisher for later access, and any subscriber that requests TRANSIENT_LOCAL Durability from this publisher will receive the stored history of messages. The ROS 2 Durability policy is much like the latching option for ROS 1.
Liveliness — this policy is used to detect the absence and presence of publishers. The default Liveliness is AUTOMATIC, which is self explanatory. The alternative, MANUAL_BY_TOPIC, is useful for detecting situations where your node may have fatal problems that don’t cause the publishers to be destroyed. This requires calling the assert_liveliness function on each publisher periodically.

Overall, QoS provides callbacks to give feedback to publishers and subscribers and controls the behavior of message passing. Keeping an eye on compatibility and registering callbacks to detect it can save a lot of time because a mismatch can be hard to spot if your subscriber is silently not receiving messages on a topic that is publishing. Be on the lookout for an article coming soon on our way of implementing all of this in practice. But in the mean time, check out these docs on ROS QoS.

Monitoring

Monitoring software in production is necessary to gain insight into system behavior and state, allows you to act on any detected issues, and provides various useful metrics such as memory usage, CPU usage, and high frequency published topics. Monitoring and troubleshooting robot applications is commonly accomplished through logs or rosbag but these are not scalable because when we get to working with large fleets of robots there is way too much data generated to expect understandable logs.

Note: Despite this, it’s still recommended to send ROS bags up to the cloud and utilize something like Amazon S3 to ingest all of the blob data. You can consider transforming the data into Parquet format for easier indexing for your perception and navigation teams.

Instead we take advantage of the tools ROS 2 gives us for straightforward data aggregation by which we can gather and express our application’s data in a summary form for statistical analysis. Here we’ll introduce some ROS 2 tools to help you aggregate metrics and publish them to a cloud service to provide robust, scalable monitoring for robotic production systems.

First up are tools for collecting and publishing data statistics. The ROS 2 libstatistics_collector package provides aggregation tools to calculate statistics in constant time while using constant memory. The aggregation of a value provides an average (specifically a moving average), maximum, minimum, standard deviation, and sample count. We also have the statistics_msgs definition which can be used to publish the collected data. Other than publishing the calculated data, this also includes the start and stop time of the collection window, metric units, metric source, and measurement name. Additionally, ROS 2 nodes which collect system metrics use the message generation method to convert collected statistics into a ROS 2 message which is periodically published to a topic which can be viewed using ROS 2 CLI tools or consumed by any subscriber.

The system_metrics_collector package provides the ability to monitor a Linux system’s CPU percent used, system memory percent used, and the ability to measure both of these for ROS 2 processes. The package defines a composable ROS 2 lifecycle node that is the collector of the Linux CPU. It is important to note that this is a Lifecycle node, where the collection, aggregation, and publication of statistics can be started and stopped with activate and deactivate commands which provides flexibility in monitoring system behavior and performance without stopping ROS 2 executables. The node periodically measures, aggregates, and publishes a message of CPU percentage used until it is stopped. Besides monitoring application performance, we can also monitor node performance by creating an arbitrary node designated to monitor its CPU and memory percentage used in a similar fashion.

Finally, we arrive at the discussion of streaming aggregated data to the cloud. One example of doing this is via the cloudwatch_metrics_collector package we can listen to the published aggregated data and stream it to the cloud instead of manually inspecting the locally published data. This package provides ROS 2 an interface for metrics nodes to listen for monitoring messages on configurable topics in order to stream them to the cloud. The cloud is known for offering major performance and storage advantages, specifically opportunities to offload expensive computation, store data on remote servers instead of locally, and gain more access to other data like maps or images in the case of robots. A huge benefit of cloudwatch_metrics_collector is the possibility to track the health of a single node and scale to fleet with the use of automated monitoring and automated actions.

Now this monitoring is by no means complete. Additional metrics should be gathered related to the business metrics of a robot (i.e. is it actually doing it’s job). In addition, monitoring systems like over the air updates, inter-robot communication, and trace back logs will be more and more important due to the variability out in the field. Thus, it’s more important than ever to ensure that you have these observability and deployment systems put in place.

The CI/CD Pipeline + GitHub Actions

Robotic systems often need continuous updates, changes, fixes, etc to sustain a robust, satisfactory product. For this we employ the CI/CD pipeline which motivates the implementation of small, incremental changes and frequent checking of version control repositories and testing. Continuous integration (CI) and continuous delivery (CD) follow a set of operating principles that enable software developers to deliver code changes more frequently and reliably.

Recently we’ve seen a trend in moving CI/CD pipelines to the cloud because of the affordable and reliable hardware infrastructure, useful runtime diagnostic metrics, handy user interfaces, and good documentation that the cloud offers. Developers can easily offload testing workflows to the cloud to reduce the need for maintaining separate infrastructure.

The ROS Tooling Working Group provides Actions to help create and set up the CI/CD pipeline in the cloud for ROS packages and can help extend the pipeline to perform more functions such as detecting memory access errors and tracking code coverage automatically (however, these can be hard to configure). A software hosting platform that provides version control and that most developers are familiar with, GitHub, has a collection of some common Actions and workflows that are easy to start out with and reuse. They can also be distributed publicly amongst the dev community in the GitHub marketplace. Simply put, Actions allow developers to run various arbitrary computations as workflows from repositories which can be programmed to run on specified events such as pull requests, new commits, etc. but let’s delve in a little further.

Workflows

At their core, workflows are simple yaml files made up of trigger conditions and execution steps. The trigger conditions determine when the workflow is executed, for example on a new commit to the master branch. It is also good practice to periodically run CI even when no changes are made in order to detect breakages caused by external dependencies. Execution steps call individual Actions, which are computation scripts, for example a JavaScript program or an arbitrary program run in Docker containers; they perform building and testing. A workflow can have more than one Action within it, however it is good practice to separate different kinds of tests into different workflows for example separate linting tests, functional tests, and end-to-end tests.

Actions

The following Actions will be useful in setting up and developing a CI/CD pipeline for your ROS 2 packages:

setup-ros — this Action supports setting up a ROS distribution environment to build one or more packages and it will install all the dependencies needed to build a ROS 2 package
action-ros-ci — this Action builds a ROS package from source and runs colcon build followed by colcon test on it (the package on which colcon test is to be invoked is provided as an argument to the Action). The Action requires a ROS environment to already be set up, either with the aforementioned setup-ros Action or using a Docker image with all necessary dependencies installed.
action-ros-lint — this Action runs ament_lint linters on a ROS package (the package name to lint and the linters are provided as arguments to the Action). A linter is a code analysis tool used to flag programming errors, bugs, style errors, and suspicious constructs.
colcon mixins — these can be used within Actions as extra command line arguments for the build tool throughout building and testing. For example, some mixins can help detect memory access errors and concurrency issues in the code, helping catch software bugs that are otherwise hard to find and avoid potential crashes. The colcon-mixin-repository provides many mixins that can be reused.

CI/CD emphasizes the importance of code coverage as high code coverage indicates high quality software. Code coverage data is generated by running the test suite of a package and determining which lines of the source code were covered, or executed, by the tests. Developers can focus on improving their code’s health with the help of code coverage tools and web services such as Codecov.io which helps read coverage data output files by showing source code lines hit, missed, and partially hit, and displays code coverage history to help track a repository’s code coverage over its lifecycle. Codecov has also developed a GitHub Action so that users can automatically get code coverage updates when the Action executes as coverage results from the Actions CI pipeline are uploaded to the Codecov website.

Release and Maintenance

Hooray! You’ve taken all these steps to ensure your robotic system is suitable for production and you’re finally ready to release your hard work. There are a few different approaches you can take towards this next step, some open source, some closed, and it is up to the developer to decide which is most suitable based on the nature of the project (commercial vs hobby).

a compiled library/binary — this is a closed source option that gives users of the system the ability to use a library or executable without being able to see or modify the source code
a hosted repository — this entails a ROS package complete with all its source code and documentation on how to compile locally, but no executables or binaries. A hosted repository is usually released on a hosting service such as GitHub
Bloom release — Bloom is OpenRobotics’ meta-tool for releasing ROS packages into the public build farm. It makes packages publicly accessible via package managers and does not require users to compile. The Bloom CLI walks developers through the necessary steps in preparing a package for distribution. The full tutorial can be found on Open Robotics’ official documentation.

Now deploying to the actual robot is the hard part. To scale to a fleet, you have to consider proper over the air update infrastructure, proper A/B testing, feature flagging, and overall release management including things like “did my build fail on this robot”, “how do I incrementally update my fleet?” “What features are working and what failed?” And then gather additional metrics on improved performance of the software.

If this is a problem that you’re facing, we’d love to set up a call with you because this is what we hope to solve.

After all that work you’ll surely want to sit back and relax, but a robotic product requires constant attention in order to remain functional and valuable for consumers. When it comes to maintenance, the goal is to provide users with an experience as seamless as possible and there are a few defined steps to take towards ensuring that.

Quick Note on Contributing to ROS2

Release requires a stable version control implementation at release to support different distributions. Commonly each distribution has a distribution-specific development branch and a Git release tag to specify it is a stable release. Each development branch stems from the master branch and highlights what changes have gone into a specific release. Any changes to a branch after release should contain a release tag to indicate that a change has occurred.

Another thing to mention is that the release cycle times for ROS are longer than those of conventional software apps such as kubernetes since robotic applications require stability and thus can’t be updated at the same pace as a website for example. It is important to remember that updating robots means putting them in a safe state where they can be configured, updated, and restarted properly.

Updating Releases

All changes made to a previously released package must be API and ABI compatible. This means that the source code relying on the API interfaces does not need to be changed and is portable, and changes do not necessitate a recompilation of a package and its dependencies. If this precedent is not followed, then the development workflow of all users of the updated package and all packages that depend on it is broken, forcing users to recompile and presenting cryptic errors.

Sometimes it is only necessary to backport specific commits that resolve issues in a previous release. Backporting is when a software patch or update is taken from a recent software version and applied to an older version of the same software.

Semantic Versioning

Utilizing a universal versioning system is a useful way to give meaning to your software releases and to keep track of every transition in your software. A system like SemVer gives software authors a way to communicate to the consumers of their software important info they should know about this release. After introducing a feature or bug fix to a package it is important to increment the version of the package to allow users to understand the impact of each update. Versions are incremented according to the type of updates made whether they are major, minor, or patches.

Documentation

I cannot stress enough the importance of good documentation for your systems. A lack of proper documentation can make or break a product because it leaves developers on their own to figure out how specific systems work and how to even get started with installing and running applications. A difficult onboarding experience is a huge turn off for users.

A baseline recommended is to include a README.md to document info about your ROS package and the software it contains (one for every package in a repository). A good README should include:

info on the objective of the package
maintenance status of the package
type and status of CI that is maintained for the package
description of software modules included in the package
a guide for users to start using the software in the package
can also include a guide for potential contributors can be included in the readme or in a separate CONTRIBUTING.md file

Go out and build for production!

Wow… that was a lot, I know. Now you have the knowledge to put this all into practice in your next robotic system. We just examined the many, many pieces that go into building a production-ready robotic system and the tools available to do so. Hopefully you’ve gained an understanding of the challenges relating to building robust software for robotics and making it ready for distribution and scaling, and can use this to think about how to develop production ready code from the start of your process.

For the advancement of the open source robotics ecosystem, it is important that all developers learn, practice, and produce documentation on production elements such as running a ROS 2 node on a robot, tweaking its behavior, monitoring its output, building a CI/CD pipeline, and releasing packages following the best community practices. When we continue building out production ready code, it will be way easier to utilize robots as a service.