Multimodality
So far I have relied on visual information from the cameras and physical information from the arm, that is joint positions and angular velocities, but what if the inputs are more complex? Again imagine an arm whose goal is to pick up vegetables, the visual information will provide per object geometric properties in order to perform accurately the reaching task and also de grasping or pre-grasping task, however if I do not add haptic information I could grasp the object and make it explode and we do not want that! So haptic feedback will provide observational data of the current contact conditions between each object and the environment, it will also help to improve accuracy in the previous mentioned task (as it helps with localization) and will also help with control specially under "problematic" conditions such as occlusion (believe me that concept for the fruit picking task is very important!). So this means that we now have more sources of information that are complementary and concurrent during the manipulation. Multimodality has often been used in other ML tasks, for example using textual information as well as visual, and it has been used in robotics as well but much of the research is mainly focused on visual data, I do not think this is an "error" from the researcher's part, keep in mind that research sometimes is about devising the best model to perform a certain task constrained by a certain variable, in this case it would be to just have visual information, but from a practical and anthropomorphical point of view it makes sense to have more sources of data, as they help shape the "state of the world".
So I will now work on training a policy that uses multimodal information, the sources as explained before will be visual and touch, these two modalities have very different dimensions, characteristics and frequencies so we will have to take that into account. The goal will be to learn a policy through self-supervision that generalizes over variations of the same manipulation task in terms of geometry, configurations and clearances, it will also need to be robust to external perturbations. So I will begin with what I previously used, that is a neural network to learn a representation, but in this case it will be a shared representation of haptic, visual and proprioceptive data. Since the goal is self-supervision, the network needs to be trained to predict optical flow, this means if contact will be made in the next control cycle and concurrency of visual and haptic data. Also since we want to encourage encoding of action-relation information we need to make the training action-conditional, this results in a compact representation of high-dim and heterogenous data as input to the policy so that we can achieve these contact-aware manipulation tasks. Another thing to keep in mind is that I am doing a decoupling of state estimation and control, this is mainly because in my experiments this is more sample efficient for learning the representation as well as the policy on the real robot. I have tried this model architecture in a "research task" (peg task) as well as real world task involving harvesting (obviously this needs more work).
But let's explore a little bit more this concept of contact-rich manipulation, these tasks are very important in manufacturing, standard approaches often use haptic feedback and force control and assume accurate state estimation, they are able to generalize to task variations, however new policies are often required for new geometries. Current research uses RL to address these variations in geometries and configuration manipulation, for example a series of learning-based approaches have relied on haptic feedback for manipulation, some of them are concerned with estimating the stability of a grasp before lifting an object, even suggesting a regrasp. There is also some work that involve learning entire manipulation policies using only haptic feedback. But since there has been promising work that train a manipulation policy in simulation and transfer them to a real robot, I will keep with that trend and focus on contact-rich tasks and add haptic feedback in simulation, as far as I am concerned there is little work on this, most likely because of the lack of fidelity of contact simulation and collision modeling for articulated rigid-body systems.
As I said before this complementarity of heterogeneous sensor modalities is something that has been explored before, specially for inference and decision making. The diversity in modalities include vision, audio, range, haptic, proprioceptive and language and it is this heterogeneity that makes the application of fused hand-designed features very complex, that is why learning methods are key and this is were current research is going. For example "Learning relational object categories using behavioral exploration and multimodal perception" mixes visual with haptic data for grasp stability, manipulation, material recognition and object categorization. Meanwhile "Learning to represent haptic feedback for partially-observable tasks" fuses vision, range but also adds language labels. In my case I am interested in multimodal representation learning for control, when learning for control, action-conditional predictive representations can encourage the state representations to capture action-relevant information. Studies attempted to predict full images when pushing objects with benign success, in these cases either the underlying dynamics is deterministic, or the control runs at a low frequency. In contrast, my setup operates with haptic feedback at 1kHz and sends Cartesian control commands at 20Hz. I use an action-conditional surrogate objective for predicting optical flow and contact events with self-supervision. There is compelling evidence that the interdependence and concurrency of different sensory streams aid perception and manipulation. However, few studies have explicitly exploited this concurrency in representation learning. Following "Audio-visual scene analysis with self-supervised multisensory features", I decided to build a self-supervised objective to fuse visual and haptic data.
Before jumping into the model details, I need to further expand on the goal of the task, and that is to learn a policy on a robotic arm to perform contact-rich manipulation task. This means that I want to evaluate a combination of multisensory information and that I also want the ability to transfer multimodal representation across tasks. Also I want sample efficiency, so that means to first learn a NN based feature representation of the multisensory data, this compact feature vector is used as input to a policy that is learned via RL. The manipulation task is modeled as a finite-horizon, discounted Markov Decision Process (MDP) M, with a state space S , an action space A , state transition dynamics 𝒯 : S x A → S , an initial state distribution p0, a reward function r : S x A → ℝ, a horizon T, and discount factor 𝑦 ∈ (0,1]. To determine the optimal stochastic policy π: S →P(A) and we want to maximize the expected discounted reward:

The policy will be represented by a NN parametrized by Θπ (we will dive into how these params are learned). S is defined by the low-dimensional representation learned from high-dimensional visual and haptic sensory data. This representation is a neural network parameterized by Θs (and again we will see how to learn them), A is defined over continuously-valued, 3D displacements ∆x in Cartesian space and finally we need to take into account the controller design, which will also be detailed once I get into the model specifics. But I guess this is enough for today, next time it will be model time!
