Manipulation

Manipulation tasks can often be decomposed into multiple subtasks performed in parallel, e.g., sliding an object to a goal pose while maintaining contact with a table. Individual subtasks can be achieved by task-axis controllers defined relative to the objects being manipulated, and a set of object-centric controllers can be combined in an hierarchy. In prior works, such combinations are defined manually or learned from demonstrations. By contrast, we propose using reinforcement learning to dynamically compose hierarchical object-centric controllers for manipulation tasks. Experiments in both simulation and real world show how the proposed approach leads to improved sample efficiency, zero-shot generalization to novel test environments, and simulation-to-reality transfer without fine-tuning.


Controllers
• End-effector controllers • Position controllers • Force controllers • Rotation controllers • Null controller • Compute joint torques using task-space impedance control: • Figure 3: Rotation Controller Composition. Here, the agent rotates the Franka rob pose (A) to the final pose (E), so the gripper aligns with a door handle. A) The controllers to choose from, aligning various axes of the gripper with different target controllers are chosen with the higher-priority labeled as (0) and the lower-priority and target axes of the lower-priority controller (green arrows) are projected dow planes) of the current axis of the higher-priority controller (gripper's blue axis). D) is formed by combining the higher-priority rotation in the blue plane with the projec in the green plane. Note that the lower-priority rotation does not interfere with the h Controlling the Robot: We use task-space impedance control to convert targets to configuration-space targets via Jacobian transpose, and we act torques. We first concatenate the translation target x with the axis-angle form the final 6D delta end-effector target . Then, the robot joint-torque c as ⌧ = J > (K S + K D˙ ), where K S and K D are diagonal stiffness and d is the analytic Jacobian. Terms for compensating gravity and Coriolis forces In practice, we cap the magnitude of to limit maximum control effort, term to the force controllers for better convergence. Once a set of contro combination runs for T timesteps before the RL policy is queried again for

RL with Object-Axis Controllers
We use RL to learn a policy that composes object-axis controllers to perfo

Controller Composition
• Priority order, apply lower priority controllers in the null-space of the higher-priority controllers • • sition: The RL policy selects at most 3 force and position controllers to e and position controllers can execute concurrently, because there are only he RL policy outputs a priority order for these controllers. Let the indices trollers in decreasing priority, so 0 is the highest, and 2 the lowest. The final ed by projecting the lower-priority targets onto the nullspaces of the highersumming them. Let N (U ) = I U † U be a nullspace projection matrix , where † denotes the pseudoinverse. Let K x be the position controller gain concatenation operator, i.e. concatenation of vectors into a matrix, e.g., ough the above expressions are written with all 3 controllers as position mentation we combine multiple position and force controllers together. If d, for the corresponding controller, swap x with f , x d with f d , x c with f c , 2 illustrates the force-position controller composition.
The RL policy selects at most two rotation controllers to compose. This ghest priority controller fixes one axis of a rotation frame, there is only left, which is a rotation in the 2D nullspace of the fixed axis. Similar to compositions, we project the errors of lower-priority controllers onto the rity controllers: Here, the agent controls the green block to push the red block up along the vertical gray wall. A) The agent is given 4 controllers to choose from, each corresponding to points of interests in the scene. B) The agent chooses 2 controllers, with the force controller into the red block at the higher priority (0), and position controller toward the wall corner at the lower priority (1). C) The error Controller Composition (Rotation) • Rotation controllers try to align a one axis to a given target axis • Priority order, apply lower priority controllers in the null-space of the higher-priority controllers • • Learn a policy to compose basic controllers for solving a task nts a concatenation operator, i.e. concatenation of vectors into a matrix, e.g., Although the above expressions are written with all 3 controllers as position implementation we combine multiple position and force controllers together. If e used, for the corresponding controller, swap x with f , x d with f d , x c with f c , igure 2 illustrates the force-position controller composition.
ition: The RL policy selects at most two rotation controllers to compose. This he highest priority controller fixes one axis of a rotation frame, there is only dom left, which is a rotation in the 2D nullspace of the fixed axis. Similar to roller compositions, we project the errors of lower-priority controllers onto the -priority controllers: mposing rotations, and K R denotes a rotation error gain. This procedure ensures rotation controller always reaches its goal, and the trajectory of that axis is not er-priority controller (see Figure 3 for an illustration).      Compared Approaches: We set N c = 3 across all experiments, which we found to be sufficient.
To evaluate the utility of our proposed object-axis controllers we compare against an RL agent that controls the robot directly via end-effector delta-poses. We call this approach EE-Space. We also evaluate the need for executing multiple controllers in parallel by comparing against a baseline which only chooses 1 controller at each timestep. We call this 1-Ctrlr. To show the efficacy of our proposed Expanded-MDP formulation we compare against both: discrete combinatorial (3-Combo) and continuous priority scores (3-Priority) action spaces. Both these approaches naively combine all possible controller combinations and we show how this can lead to sub-optimal performance.

Metrics:
We report the success rates of the learned policies separately for train and test environment configurations. Performance on the train set indicates whether or not the approach can robustly solve a task, and performance on the test set evaluates generalization abilities. Test set is split into two subsets, one with small deviations from the train configurations, and another with larger deviations. We report additional results including more fine-grained analysis for each task in the Appendix.    (   Abstract-For humans, the process of grasping an object relies vily on rich tactile feedback. Most recent robotic grasping rk, however, has been based only on visual input, and thus not easily benefit from feedback after initiating contact. In this er, we investigate how a robot can learn to use tactile infortion to iteratively and efficiently adjust its grasp. To this end, propose an end-to-end action-conditional model that learns rasping policies from raw visuo-tactile data. This model -a p, multimodal convolutional network -predicts the outcome a candidate grasp adjustment, and then executes a grasp by atively selecting the most promising actions. Our approach reires neither calibration of the tactile sensors, nor any analytical deling of contact forces, thus reducing the engineering effort uired to obtain efficient grasping policies. We train our model h data from about 6,450 grasping trials on a two-finger gripper ipped with GelSight high-resolution tactile sensors on each ger. Across extensive experiments, our approach outperforms a iety of baselines at (i) estimating grasp adjustment outcomes, selecting efficient grasp adjustments for quick grasping, (iii) reducing the amount of force applied at the fingers, ile maintaining competitive performance. Finally, we study choices made by our model and show that it has successfully the difficulty of integrating tactile inputs into standard con trol schemes. Consequently, the predominant input modalitie currently used in the robotic grasping literature are vision and depth.

Overall Architecture
Results (Analysis) • • 125g 65g 135g 30g 380g 140g 10g grasp success (# success / # trials) % (5/10) 90% (9/10) 40% (4/10) 60% (6/10) 90% (9/10) 10% (1/10) 100% (10/10) 63.2% % (9/10) 100% (10/10) 90% (9/10) 100% (10/10) 80% (8/10) 90% (9/10) 90% (9/10) 110g grasp success (# success / # trials) % (8/10) 40% (4/10) 60% (6/10) 50% (5/10) 50% (5/10) 50% (5/10) 20% (2/10) 50% (9/10) 70% (7/10) 100% (10/10) 40% (4/10) 60% (6/10) 80% (8/10) 60% ( Figure 4: Predicted grasp success rate with varying the amount of force F . The model learned that, when stably in contact with the object, there is a correlation between force applied and success rate. However, for unstable grasps, the model learned that increasing the grasp force might misplace the object and result in an unsuccessful grasp. there is a correlation between the force and the grasp outcome. However, further analysis shows that the model did not just learn to increase the force in all cases: for multiple situations having very high forces seems to reduce the predicted success rate. For example, we saw this occur when the robot grasped a cube whose corner was only half in contact with the fingers. Due to the shape of the fingers, applying large forces in this  Figure 5: What does the model learn? Here we show examples where the network predicts that a downward motion will result in a grasp with (a) higher or (b) lower chance of succeeding. Notice that downward movement is predicted to be beneficial for cases where the fingers hold the top of an object, but not when they hold it by the bottom. To more clearly visualize the contact on the robot's fingertip, we show the change in intensity of the GelSight images. the center-of-mass, and the preference for moving downward. In Fig. 5, we show examples, taken from our dataset, of cases in which the model strongly preferred a downward motion of the minimum force grasp optimization was substantially lower compared to the maximum success criterion (mean of 10 vs 20 N). Similar results were obtained also when evaluating