Enabling Real-World Object Manipulation with Sim-to-Real Transfer

By Thomas Kollar, Michael Laskey, Kevin Stone, Brijen Thananjeyan, and Mark Tjersland

Toyota Research Institute
Toyota Research Institute



While a human can easily differentiate between an object and its reflection, transparent or reflective items commonly found in the home befuddle today’s robots. Since most robots are programmed to react to the objects and geometry in front of them without considering the context of the situation, they are easily fooled by a glass table, shiny toaster or transparent cup.

To overcome these challenges, we developed a novel method to perceive the 3D geometry of the scene while also detecting objects and surfaces. A few weeks ago, we released a video to demonstrate this new technical achievement, and this blog post outlines our approach. A key component is the ability to learn how far objects in the scene are away from the robot without using specialised sensors, such as RGB-D sensors. By learning to predict distance, the robot is able to find previously seen and unseen objects and surfaces in the scene that it can manipulate.

Building on our previous work, in this post we detail how robots can manipulate a wide range of unknown objects, from transparent and shiny objects to deformable objects such as t-shirts. The three components of our approach include 1) a simulation environment, 2) the SimNet model and 3) a classical planning technique that enables novel objects to be grasped from the predictions.


Given the complexity of predictions that the robot needs to make, it is impractical to label a sufficient amount of real data. So we developed a non-photorealistic simulator that randomizes lighting and textures and can be used to automatically produce large-scale domain-randomized data, including stereo images [1][2][3], 3D oriented bounding boxes, object keypoints, and segmentation masks (see the below Figure for multiple domains: cars, small objects and t-shirts). This data forces the robot perception system to focus on the geometry of the scene and not on the texture of the objects, enabling precise manipulation. Low-quality rendering greatly speeds up computation, and allows for data-set generation on the order of an hour. Please visit the SimNet repository to see more examples or to download the full dataset used in this paper, including real-world validation datasets.

SimNet Model:

The second component of our approach is SimNet (see Figure below), a lightweight multi-headed neural network that is used for 3-D perception of objects and surfaces. SimNet is trained exclusively on simulated data. The input to the network is synthetic stereo pairs from our simulator. There are four primary outputs. The first output includes segmentation masks, which identify the outline and shape of the object. The second output includes 3D oriented bounding boxes (OBBs), which define the principal extents of an object for grasping. The third output includes keypoints, which can be used to identify the sleeves or neckline of a shirt. The fourth output is a full resolution disparity image, which is used to determine how far the object is from the camera. The model structure is shown in this Figure:

A key insight of SimNet is that the incorporation of scene geometry is important for creating models that work in real-world environments, both for downstream prediction tasks and for robust manipulation. For example, we found that there is enough information in the shape of a cup to determine that it is an object that can be manipulated and to know the extents of the object well enough to manipulate it. Disparity predictions from SimNet are better than traditional sensors such as structured light sensors that use RGB-D [6][7][8][9][10] since SimNet can work on both transparent objects and in challenging lighting conditions even though it has only been trained in simulation.

Grasping and Manipulation

The perception output from SimNet contains the building blocks for performing manipulation of both rigid and deformable objects. For rigid objects, the perceived 3-D oriented bounding boxes are used to produce grasp positions, a common approach for grasping objects by their convex hull. For deformable objects such as t-shirts, keypoints can be predicted for things like sleeve locations or the neck of a t-shirt. These keypoints and oriented bounding boxes are then converted to grasp positions, which, when combined with a classical planner, enables the execution of complex manipulation behaviors [9][11][12][13][14]. One of the key advantages of our approach is that once SimNet has been trained, it can be used to manipulate a wide variety of unknown objects in an environment. SimNet has been demonstrated to perform manipulation of unknown objects in both optically “easy” and “hard” scenarios using our fleet of Toyota HSR robots in four home environments (see Figure below). An example of the grasping pipeline can be seen in the following Figure:

Grasping Pipeline Example

Like SimNet, a model that uses input from a structured light RGB-D camera (typical for most computer vision setups in robotics) grasps most “easy” objects. However, the RGB-D model only grasps 35% of “hard” (e.g., transparent) objects, while SimNet grasps 95%. This suggests that SimNet can enable robust manipulation of unknown objects, including transparent objects, in unknown environments. As the image below illustrates, SimNet can perceive a wide variety of transparent objects; the disparity predictions (top right corner of the images) from the learned stereo model is significantly better than RGB-D in these scenarios, leading to better manipulation performance in these challenging environments:

Experiments were performed across four homes. Using predictions from SimNet, the robot is able to grasp many unknown objects. Some of the grasping runs can be seen below (at 10x speed except where noted):

Using a different output head, SimNet can also be used to fold t-shirts by using keypoints on the t-shirt. In order to do this, first the robot predicts keypoints on the shirt (such as a sleeve) and a grasp position is generated. Using these grasp positions, a classical planner is called to perform the motion plan, including the motions that execute each fold of the t-shirt. An example can be seen in the video below (at 10x speed):

This research, described in more detail in the paper, adds to the body of knowledge helping robots to reliably navigate and operate in home environments. SimNet is an efficient, multi-headed prediction network that leverages approximate stereo matching to transfer from simulation to reality. The network is trained on entirely simulated data and robustly transfers to real images of unknown optically-challenging objects such as glassware, even in direct sunlight. These predictions are sufficient for robot manipulation such as t-shirt folding and grasping. In future work, we will plan to use SimNet to automate household chores. Please visit the SimNet repository for real-world validation datasets and code that can be used to reproduce our results.


[1] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017.

[2] F. Sadeghi and S. Levine. Cad2rl: Real single-image flight without a single real image.arXiv preprint arXiv:1611.04201, 2016.

[3] N. Mayer, E. Ilg, P. Fischer, C. Hazirbas, D. Cremers, A. Dosovitskiy, and T. Brox. What makes good synthetic training data for learning disparity and optical flow estimation; International Journal of Computer Vision, 126(9):942–960, 2018.

[4] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and T. Brox. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4040–4048, 2016.

[5] A. Kendall, H. Martirosyan, S. Dasgupta, P. Henry, R. Kennedy, A. Bachrach, and A. Bry. End-to-end learning of geometry and context for deep stereo regression. In Proceedings of the IEEE International Conference on Computer Vision, pages 66–75, 2017.

[6] J. Mahler, M. Matl, V. Satish, M. Danielczuk, B. DeRose, S. McKinley, and K. Goldberg. Learning ambidextrous robot grasping policies.Science Robotics, 4(26), 2019.

[7] C. Xie, Y. Xiang, A. Mousavian, and D. Fox. The best of both modes: Separately leveraging rgb and depth for unseen object instance segmentation. In Conference on robot learning, pages 1369–1378. PMLR, 2020.

[8] S. Sajjan, M. Moore, M. Pan, G. Nagaraja, J. Lee, A. Zeng, and S. Song. Clear grasp: 3d shape estimation of transparent objects for manipulation. In 2020 IEEE International Conference on Robotics and Automation 311(ICRA), pages 3634–3642. IEEE, 2020.

[9] P. Sundaresan, J. Grannen, B. Thananjeyan, A. Balakrishna, M. Laskey, K. Stone, J. E. Gonzalez, and K. Goldberg. Learning rope manipulation policies using dense object descriptors trained on synthetic depth data. In2020 IEEE International Conference on Robotics and Automation (ICRA), pages 9411–9418.320 IEEE, 2020.

[10] M. Danielczuk, M. Matl, S. Gupta, A. Li, A. Lee, J. Mahler, and K. Goldberg. Segmenting unknown 3d objects from real depth images using mask r-cnn trained on synthetic data. In 2019 International Conference on Robotics and Automation (ICRA), pages 7283–7290. IEEE, 2019.

[11] J. Grannen, P. Sundaresan, B. Thananjeyan, J. Ichnowski, A. Balakrishna, M. Hwang, V. Viswanath, M. Laskey, J. E. Gonzalez, and K. Goldberg. Untangling dense knots by learning task-relevant keypoints. arXiv preprint arXiv:2011.04999, 2020.

[12] A. Ganapathi, P. Sundaresan, B. Thananjeyan, A. Balakrishna, D. Seita, J. Grannen, M. Hwang, R. Hoque, J. E. Gonzalez, N. Jamali, et al. Learning dense visual correspondences in simulation to smooth and fold real fabrics.arXiv preprint arXiv:2003.12698, 2020.

[13] D. Seita, N. Jamali, M. Laskey, A. K. Tanwani, R. Berenstein, P. Baskaran, S. Iba, J. Canny, and K. Goldberg. Deep transfer learning of pick points on fabric for robot bed-making.arXiv preprint arXiv:1809.09810,3552018.

[14] P. Sundaresan, B. Thananjeyan, J. Chiu, D. Fer, and K. Goldberg. Automated extraction of surgical needles from tissue phantoms. In 2019 IEEE 15th International Conference on Automation Science and Engineering (CASE), pages 170–177. IEEE, 2019



Toyota Research Institute
Toyota Research Institute

Applied and forward-looking research to create a new world of mobility that's safe, reliable, accessible and pervasive.