A Simple Approach to Teaching Marvin New Tricks

Using Open Source Remote Services to Give a NAO Robot New Cognitive Skills

A couple of years ago I helped teach a NAO robot, named Marvin, to play Rock-paper-scissors using Apache Spark Machine Learning and Apache CouchDB. Our developer advocacy team took Marvin on tour to a bunch of tech conferences around the world.

Marvin, a NAO Robot, playing Rock-paper-scissors

Having access to a wide range of free open source state-of-the-art deep learning models, I always wanted to teach him a new set of cognitive skills. Specialized object detection capabilities, for example, could serve as the foundation for a Where’s Wally-ish game. A PoseNet model could launch a new career as a mime. Endless possibilities!

Investigating potential options how to implement those skills, I quickly settled on an approach that minimizes the resource requirements on the robot and provides a high degree of flexibility: connect him to [remote] services that implement those skills.

Diagram showing how microservices will be implemented via HTTP requests. Back to school, without the fuzz.

To leverage a skill, Marvin:

  • captures the desired input (such as video, image, or audio),
  • sends an HTTP request with the appropriate payload to a local or cloud hosted service that implements the skill using deep learning models, and
  • interprets the response and execute the appropriate action.

A basic behavior definition in Choregraphe (the desktop app used to create animations and behaviors for NAOs) that implements these steps might look as follows:

View of a basic behavior definition in Choregraphe that leverages a vision-based skill.

This basic behavior definition is made up of:

  • A Python script box, configuring connectivity for the service(s) that the behavior consumes.
  • A Select Camera box, choosing the desired input for image capture or video recording.
  • A Take Picture, Record Audio, or Record Video box, capturing the desired input.
  • A Python script box, processing the captured input using the local or cloud service.

To achieve acceptable performance, I used mostly a local network setup, connecting the robot to microservices running on Docker on a laptop.

Can you see me now?

Let’s take as an example an object detection-based interaction.

Deploying the Inception ResNet v2 model-serving microservice from the free Model Asset Exchange, you can use the service’s model-serving REST API endpoint (Swagger spec) to detect and identify objects in the robot’s field of vision.

Marvin’s view of my workspace.

Making a POST request to /mode/predict, passing as a parameter an image that the camera captured, the response includes an array of predictions, which contain for each detected object a label and a probability.

This is what the relevant portion of the JSON response looks like for Marvin’s view of my workspace:

"predictions": [
"label_id": "n03782006",
"label": "monitor",
"probability": 0.3031136095523834
"label_id": "n03180011",
"label": "desktop_computer",
"probability": 0.26051175594329834
"label_id": "n03179701",
"label": "desk",
"probability": 0.17006246745586395

The Python source code in the analyze_view script in the robot’s behavior definition implements the interaction with the microservice as follows:

To test the basic setup, I let the robot just vocalize the detected objects (like it is shown in the code snippet above) using the built-in text-to-speech module.

In general it takes just a couple of minutes to “teach” Marvin a new skill. What’s left to do (and this is in most cases the time consuming part of these kinds of projects) is to wrap the skill in engaging behaviors that entertain the audience and highlight Marvin’ snarky personality.

If you own (or have access to) a NAO robot, follow the instructions I’ve provided for you on github to teach your robot a new skill or two. Check out the Model Asset eXchange for more free open source, deployable, and trainable machine learning models.