Grasp’D: Differentiable Contact-rich
Grasp Synthesis for Multi-fingered Hands
Abstract
The study of hand-object interaction requires generating viable grasp poses for high-dimensional multi-finger models, often relying on analytic grasp synthesis which tends to produce brittle and unnatural results. This paper presents Grasp’D, an approach for grasp synthesis with a differentiable contact simulation from both known models as well as visual inputs. We use gradient-based methods as an alternative to sampling-based grasp synthesis, which fails without simplifying assumptions, such as pre-specified contact locations and eigengrasps. Such assumptions limit grasp discovery and, in particular, exclude high-contact power grasps. In contrast, our simulation-based approach allows for stable, efficient, physically realistic, high-contact grasp synthesis, even for gripper morphologies with high-degrees of freedom. We identify and address challenges in making grasp simulation amenable to gradient-based optimization, such as non-smooth object surface geometry, contact sparsity, and a rugged optimization landscape. Grasp’D compares favorably to analytic grasp synthesis on human and robotic hand models, and resultant grasps achieve over 4× denser contact, leading to significantly higher grasp stability. Video and code available at: graspd-eccv22.github.io.
Keywords:
Multi-finger grasping, grasp synthesis, vision-based grasping1 Introduction
Humans use their hands to interact with objects of varying shape, size, and material thousands of times throughout a single day. Despite being effortless – almost instinctive – these interactions employ a complex visuomotor system, with components that correspond to dedicated areas of computer vision research. Visual inputs from the environment are processed in our brain to recognize objects of interest (object recognition [viola2001rapid, dalal2005histograms, felzenszwalb2009object, girshick2014rich, duan2019centernet]), identify modes of interaction to achieve a certain function (affordance prediction [brahmbhatt2019contactdb, do2018affordancenet, lau2016tactile, porzi2016learning, roy2016multi]), and position our hand(s) in a way that enables that function (pose estimation [hamer2009tracking, supanvcivc2018depth, zimmermann2017learning, baek2019pushing, boukhayma20193d, ge20193d], grasping [kokic2020learning, fang2018tog-ijrr, turpin2021gift]). Proficiency in this task comes from accumulated experience in interacting with the same object over time, and readily extends to new categories or different instances of the same category.
This is an intriguing observation: humans can leverage accumulated knowledge from previous interactions, to quickly infer how to successfully manipulate an unknown object, purely from visual input. Granting machines the same ability to directly translate visual cues into plausible grasp predictions can have significant practical implications in the way robotic manipulators interact with novel objects [saxena2006robotic, fang2018tog-ijrr] or in virtual environments in AR/VR [de2017human, gammieri2017coupling].
Grasp prediction has previously been considered in the context of computer vision [yang2015grasp, nakamura2017complexities, huang2015we, heumer2007grasp] and robotics [pirk2017understanding]. It amounts to predicting the base pose (position and rotation) and joint angles of a robotic or human hand that is stably grasping a given object. This prediction is usually conditioned on visual inputs, such as RGB(D) images, point clouds, etc., and is typically performed online for real-time applications. Predicting grasps from visual inputs can be naturally posed as a learning problem, using paired visual data with their respective grasp annotations. However, capturing and annotating human grasps is laborious and not applicable to robotic grasping, so researchers often rely on datasets of synthetically generated grasps instead (see Table 1 for a list of recent works). Consequently, high-quality datasets of plausible, diverse grasps are crucial for any modern vision system performing grasp prediction, motivating the development of better methods for grasp synthesis.
Grasp synthesis assumes that the complete object geometry (e.g., mesh) is known, and is usually achieved by optimizing over a grasping metric which can be computed analytically or through simulation. Analytic metrics are handcrafted measures of a grasp’s quality. For example, the epsilon metric [ferrari1992planning] measures the magnitude of the smallest force that can break a grasp, computed as a function of the contact positions and normals that the grasp induces. While analytic metrics can be computationally faster, they often transfer poorly to the real world. Simulation-based metrics [eppner2021acronym, kappler2015leveraging, zhou20176dof] measure grasp quality by running a simulation to test grasp effectiveness, e.g., by shaking the object and checking whether it is dropped. These can achieve a higher degree of physical fidelity, but require more computation. In both cases, optimization is usually black box, as neither the analytic metric or simulator is differentiable. Black box optimization can find good grasps in a reasonable number of steps as long as the search space is low-dimensional, e.g., when searching the pose space of parallel-jaw grippers [eppner2021acronym, veres2017integrated, depierre2018jacquard, mousavian20196, eppner2019billion]. However, when the number of degrees of freedom becomes larger, as in the case of multi-finger grippers, black box optimization over a grasping metric (whether analytic or simulation-based) becomes infeasible. Simplifying assumptions can be made to reduce the dimensionality of the search space, but they often reduce the plausibility of generated grasps.
To address these shortcomings, we propose Grasp’D, a grasp synthesis pipeline based on differentiable simulation which can generate contact-rich grasps that realistically conform to object surface geometry without any simplifying assumptions. A metric based on differentiable simulation admits gradient-based optimization, which is sample-efficient, even in high-dimensional spaces, and affords all the benefits of simulation-based metrics, i.e., physical plausibility, scalability, and extendability. Differentiable grasping simulation, however, also presents new challenges. Non-smooth object geometry (e.g., at the edges or corners of a cube) results in discontinuities in the contact forces and, subsequently, our grasping metric, complicating gradient-based optimization. Adding to that, if the hand and the object are not touching, small perturbations to the hand pose do not generate any additional force, resulting in vanishing gradients. Finally, the optimization landscape is rugged, making optimization challenging. Once the hand is touching the object, small changes to the hand pose may result in large changes to contact forces (and our metric).
We address these challenges as follows: (1) At the start of each optimization, we simulate contact between the hand and a smoothed, padded version of the object surface that gradually resolves to the true, detailed surface geometry, using a coarse-to-fine approach. This smoothing softens discontinuities in surface normals, allowing gradient-based optimization to smoothly move from one continuous surface area to another. This is enabled by our signed-distance function (SDF) approach to collision detection, which lets us freely recover a rounded object surface as the radius level set of the SDF. (2) We allow gradients to leak through force computations for contact points that are not yet in the collision, introducing a biased gradient that can be followed to create new contacts. The intuition behind this choice is similar to the one for using LeakyReLU activations to prevent the phenomenon of “dying neurons” in deep neural networks [maas2013rectifier]. (3) Inspired by Contact-Invariant Optimization (CIO) [mordatch2012discovery, mordatch2012contact], we relax the problem formulation by introducing additional force variables that allow physics violations to be treated as a cost rather than a constraint. In effect, this decomposes the problem into finding contact forces that solve the task (of keeping the object stably in place) and finding a hand pose that provides those forces. We evaluate our method on synthetic object models from ShapeNet [chang2015shapenet] and object meshes reconstructed from the YCB RGB-D dataset [calli2017yale]. Experimental results show that our method generates contact-rich grasps with physical realism and with favorable performance against an existing analytic method [hasson2019learning].
Figure 1 displays example grasps generated by our method side-by-side with grasps from [hasson2019learning]. Because we do not make assumptions about contact locations or reduce the dimensionality of the search space, our method can discover contact-rich grasps that are more stable and more plausible than the fingertip-only grasps usually discovered by analytic synthesis. The same procedure works equally for robotic hands. Figure 2 displays snapshots of an optimization trajectory for an Allegro hand. As optimization progresses and our simulated metric decreases, the grasp becomes increasingly stable, plausible, and high-contact.
1.0.1 Summary of contributions:
-
We propose a differentiable simulation-based protocol for generating synthetic grasps from visual data. Unlike other simulation-based approaches, our method can scale to tens of thousands of dense contacts, and discover plausible, contact-rich grasps, without any simplifying assumptions.
-
We address challenges arising from the differentiable nature of our scheme, using a coarse-to-fine SDF collision detection approach, defining leaky gradients for contact points that are not yet in collision, and integrating physics violations as additional terms to our cost function.
-
We show that our method finds grasps with better stability, lower interpenetration, and higher contact area when compared to analytic grasp synthesis baselines, and justify our design choices through extensive evaluations.
2 Related Work
Grasp synthesis. Although analytic metrics have been successfully applied to parallel-jaw gripper grasp synthesis (based on grasp wrench space analysis [ferrari1992planning, miller2004graspit, goldfeder2009columbia], robust grasp wrench space analysis [weisz2012pose, mahler2017dex], or caging [rodriguez2012caging, mahler2016energy]), more recent works [depierre2018jacquard, kappler2015leveraging, mousavian20196, eppner2021acronym] have focused on simulation-based synthesis. While they are more computationally costly, simulation-based metrics for parallel-jaw grasps better align with human judgement [kappler2015leveraging] and with real world performance [mahler2019learning, danielczuk2019reach, mousavian20196, eppner2021acronym]. In contrast to parallel-jaw grippers, multi-finger grasp synthesis is still largely analytic, with many recent works in multi-finger robotic grasping [shao2020unigrasp, lundell2020multi, lundell2021ddgc], grasp affordance prediction [karunratanakul2020grasping], and hand-object pose estimation [hasson2019learning, doosti2020hope, jiang2021hand] relying on datasets of analytically synthesized grasps (see Table 1). Notably, [lundell2020multi, lundell2021ddgc, karunratanakul2020grasping, hasson2019learning, doosti2020hope, jiang2021hand] all use datasets synthesized with the GraspIt! [miller2004graspit] simulator, which is widely used for both multi-finger robotic and human grasp synthesis. The ObMan dataset [hasson2019learning] for hand-object pose estimation (also used in [karunratanakul2020grasping, jiang2021hand]) is constructed by performing grasp synthesis with the MANO hand [MANO:SIGGRAPHASIA:2017] in the GraspIt! Eigengrasp planner, and rendering the synthesized grasps against realistic backgrounds. The GraspIt! Eigengrasp planner optimizes analytic metrics based on grasp wrench space analysis. Dimensionality reduction [ciocarlie2007dexterous] in the hand joint space, or using pre-specified contact locations for each hand link can be used to make the problem more tractable, but this limits the space of discoverable grasps and requires careful tuning. Our approach can successfully operate in the full grasp space, eschewing such simplifying assumptions while excelling in terms of physical fidelity over analytic synthesis for multi-finger grippers.
Human grasp capture. To estimate human grasps from visual inputs, existing methods train models on large-scale datasets [brahmbhatt2019contactdb, brahmbhatt2020contactpose, hampali2020honnotate, garcia2018first, tzionas2016capturing]. Collecting these datasets puts humans in a lab with precise, calibrated cameras, lidar, and special gloves for accurately capturing human grasp poses. A human in the loop may also be needed for collecting annotations. All these requirements make the data collection process expensive and laborious. In addition, the captured grasps are only appropriate for human hands and not for robotic ones (which are important for many applications [chen2022system, allshire2021transferring]). Some works [lakshmipathy2022contact, brahmbhatt2019contactgrasp] aim to transfer human grasps to robotic hands by matching contact patterns, but these suffer from important limitations, since the same contacts may not be achievable by human and robotic hands, given differences in their morphology and articulation constraints (e.g., see Fig. 8 of [lakshmipathy2022contact]). Our method provides a procedural way of generating high quality grasps for any type of hand – human or robotic.
Vision-based grasp prediction. Whereas grasp synthesis is useful for generating grasps when full object geometry is available (i.e., a mesh or complete SDF is given), practical scenarios require predicting grasps from visual input. GANHand [corona2020ganhand] learns to predict human grasp affordances (as poses of a MANO [MANO:SIGGRAPHASIA:2017] hand model) from input RGBD images using GANs. Since analytic synthesized datasets do not include many high-contact grasps, the authors also released the YCB Affordance dataset of 367 fine-grained grasps of the YCB object set [calli2017yale], created by manually setting MANO hand joint angles in the GraspIt! simulator’s GUI. Rather than predicting joint angles, Grasping Field [karunratanakul2020grasping] takes an implicit approach to grasp representation by learning to jointly predict signed distances for the MANO hand and the object to be grasped. For parallel-jaw grippers, most recent works [mahler2019learning, mousavian20196, sundermeyer2021contact, jiang2021synergies] learn from simulation-based datasets (e.g., [kappler2015leveraging, eppner2021acronym]). In contrast, multi-finger grasp prediction systems are still trained on either analytically synthesized datasets or datasets of captured human grasps (see Table 1). [lundell2020multi, lundell2021ddgc, karunratanakul2020grasping, hasson2019learning, doosti2020hope, jiang2021hand] all use analytically synthesized datasets from the GraspIt! simulator [miller2004graspit], whereas [brahmbhatt2019contactdb, brahmbhatt2020contactpose, taheri2020grab] use datasets of captured human grasps. [grady2021contactopt, jiang2021hand] use captured human grasps to train a contact model, then refine grasps at test-time by optimizing hand pose to match predicted contacts. The higher quality training data generated by our grasp synthesis pipeline can lead to improved performance for any of these vision-based grasping prediction systems. Our system can also be used directly for vision-based grasp prediction, by running simulations with reconstructed objects (see Section 4.3).
Differentiable Grasping. We know of two works that have created differentiable grasp metrics in order to take advantage of gradient-based optimization for multi-finger grasp synthesis. [liu2020deep] formulates a differentiable version of the epsilon metric [ferrari1992planning] and uses it to synthesize grasps with the shadow robotic hand. They formulate the epsilon metric computation as a semidefinite programming (SDP) problem. Sensitivity analysis on this problem can then provide the gradient of the solution with respect to the problem parameters, including gripper pose. They manually label potential contact points on the gripper. In contrast, we are able to scale to tens of thousands of contact points. Since the gripper may not yet be in contact with the object, they use an exponential weighting of points. Liu et al. [liu2021synthesizing] formulate a differentiable force closure metric and use gradient-based optimization to synthesize grasps with the MANO [MANO:SIGGRAPHASIA:2017] hand model. Their formulation assumes zero friction and that the magnitude of all contact forces is uniform across contact points (although an error term allows both of these constraints to be slightly violated). Our method requires neither of these assumptions: the user can specify varying friction coefficients, and contact forces at different points are free to vary realistically. Their optimization problem involves finding a hand pose and a subset of candidate contact points on the hand that minimize an energy function. They find that the algorithm performs better with a smaller number of contact points and candidates. Selecting 3 contact points from the 773 candidate vertices of the MANO hand, it takes about 40 minutes to find 5 acceptable grasps. In contrast, our method is able to scale to tens of thousands of contact points while synthesizing an acceptable grasp in about 5 minutes. Notably, both of these prior works aim to take an analytic metric (the epsilon metric [ferrari1992planning]) and make a differentiable variant. In contrast, we are presenting a differentiable simulation-based metric, which prior work on parallel-jaw grippers suggests will have greater physical fidelity [danielczuk2019reach, mousavian20196, eppner2021acronym] and better match human judgements [kappler2015leveraging] than analytic metrics.
Differentiable Physics. There has been significant progress in the development of differentiable physics engines [hu2019chainqueen, hu2019difftaichi, geilinger2020add, brax2021github, werling2021fast, qiao2021efficient, heiden2021neuralsim, heiden2021disect, xie2022shac]. However, certain limitations in recent approaches render them inadequate. Brax [brax2021github] and the Tiny Differentiable Simulator [heiden2021neuralsim] only support collision primitives and cannot model general collisions between objects. Nimblephysics [werling2021fast] supports mesh-to-mesh collision, but cannot handle cases where the gradient of contact normals with respect to position is zero (e.g., on a mesh face). While its analytic computation of gradients is fast, Nimblephysics requires manually writing forward and backward passes in C++, and only runs on CPU. Our work presents a new class of differentiable physics simulators to addresses many of these shortcomings. Further, Grasp’D supports GPU parallelism, enabling us to scale to tens of thousands of contacts, effectively approximating surface contacts.
3 Grasp’D: Differentiable Contact-rich Grasp Synthesis
We present a method for solving the grasp synthesis problem (Figure 3). From an input object and hand model (represented respectively by a signed-distance function and an articulation chain with mesh links), we generate a physically-plausible stable grasp, as a base pose and joint angles of the hand. This is achieved by iterative gradient-based optimization over a metric computed by differentiable simulation. The final grasp is dependent on the pose initialization of the hand, so different grasps can be recovered by sampling different starting poses. We detail our method below, but first outline the challenges that motivate our design.
Non-smooth object geometry. When optimizing the location of contacts between a hand and a sphere, the gradient of contact normals with respect to contact positions is well-defined and continuous, allowing gradient-based optimization to smoothly adjust contact positions along the sphere surface. But most objects are not perfectly smooth. Discontinuities in surface normals (e.g., at the edges or corners of a cube) result in discontinuities in contact normals and their gradients with respect to contact positions. Gradient-based optimization cannot effectively optimize across these discontinuities (e.g., cannot follow the gradient to move contact locations from one face of a cube to another). We address this with a coarse-to-fine smoothing approach, optimizing against a smoothed and padded version of the object surface that gradually resolves to the true surface as optimization continues (see Section 3.2).
Contact sparsity. Of all possible contacts between the hand and object, only a sparse subset is active at any given time. If a particular point on the hand is inactive (not in contact with the object), then an infinitesimal perturbation of the hand pose will not change its status (make it touch the object). The gradient of the force applied by any inactive contact (with respect to hand pose) will be exactly zero. This means that gradient-based optimization can not effectively create new contacts, since contacts that are not already active do not contribute to the gradient. We address this by allowing gradient to leak through the force computations of inactive contacts (see Section 3.3).
Rugged optimization landscape. When many contacts are active (i.e., hand touching the object), small changes to hand pose may result in large changes to contact forces and, subsequently, large changes to our grasp metric. This makes gradient-based optimization challenging. We address this with a problem relaxation inspired by Contact-Invariant Optimization [mordatch2012discovery, mordatch2012contact] (see Section 3.4).
3.1 Rigid body dynamics
In the interest of speed and simplicity, we limit ourselves to simple rigid body dynamics. Let and be the joint and spatial coordinates, respectively, with first and second time derivatives , , , . Let M be the mass matrix. The kinematic map H maps joint coordinate time derivatives to spatial velocities as , and is related to contact and external forces ( and ) through the following motion equation: , which yields the semi-implicit Euler update used for discrete time stepping [bender2014interactive]:
(1) | ||||
(2) |
3.2 Object model with coarse-to-fine surface smoothing
SDF representation. For the purpose of collision detection, the hand is represented by a set of surface points , and the object to grasp is represented by its Signed Distance Function (SDF), (similar to [fuhrmann2003distance, macklin2020local, bender2014continuous]). The SDF maps a spatial position to its distance to the closest point on the surface of the object, with a negative or positive sign for interior and exterior points, respectively [osher2006level]. The object surface can be recovered as the zero level-set of the SDF: . The gradient of the SDF is always of unit magnitude, corresponds to the surface normal for on the object surface, and yields the closest point on the object as . SDF representations are well-suited to differentiable collision detection [macklin2020local], since contact forces can be written in terms of a penetration depth () and normal direction (), for which gradients can be computed as and , respectively.
Whereas primitive objects (e.g., a sphere or box) admit an analytic SDF, this is not the case for complex objects, for which an SDF representation is not readily available. We model the object to be grasped by a discretized SDF which we extract from ground truth meshes (easier to come by for most object sets [calli2017yale, chang2015shapenet]), yielding a 3D grid. Given a query point , to compute based on the grid, we first convert to local shape coordinates (where the object is in canonical pose: unrotated and centered at the origin), yielding . If falls within the bounds of the grid, we map it to grid indices and compute by tri-linear interpolation of neighbouring grid cells. If falls outside the grid, we clamp it to the grid bounds, yielding , and compute .
Coarse-to-fine smoothing. To successfully optimize contact locations over non-smooth object geometry we employ surface smoothing in a coarse-to-fine way. At the start of each optimization, we define the object surface not as the zero level-set of the SDF, but as the radius level-set: , which gives a smoothed and padded version of the original surface. As optimization continues, we decrease on a linear schedule until it reaches , yielding the original surface. This coarse-to-fine smoothing allows gradient-based optimization to effectively move contact points across discontinuities and prevents the optimization from quickly overfitting to local geometric features. We set to approximately 10cm at the start of each optimization. Details are in Appendix A.2.
3.3 Contact dynamics with leaky gradient
Contact forces. We use a primal (penalty-based) formulation of contact forces, which allows us to compute derivatives with autodiff [baydin2018automatic] and keep a consistent memory footprint. For a given point , the resultant contact force is
(3) | ||||
(4) | ||||
(5) |
where is the normal component, proportional to penetration depth , and is the frictional component, computed using a Coulomb friction model. and are the normal and frictional stiffness coefficients, respectively, is the friction coefficient, and is the component of relative velocity between hand and object at the contact point that is tangent to the contact normal .
Leaky gradients. At any one time, most possible hand-object contacts are inactive – a property we refer to as contact sparsity. Since an infinitesimal perturbation to hand pose will not activate these contacts (i.e., will not make them touch the object), the gradient of their contact forces with respect to hand pose is zero, i.e., . When the hand is not touching the object, all contacts are inactive and gradient-based optimization can get stuck in a plateau. We work around this by computing a leaky gradient for the normal force term. From equation (4), we have if but we instead set
(6) |
where controls how much gradient leaks through the minimum. We set in our experiments.
3.4 Grasping metric and problem relaxation
Simulation setup. To compute the grasp metric, we simulate the rigid-body interaction between a hand and an object. The hand is kinematic (does not react to contact forces), while the object is dynamic (thus subject to contact forces). The simulator state is given by the configuration vector and its first and second time derivatives . is composed of hand and object components with corresponding spatial coordinates . The object is always initialized with the same configuration : unrotated and untranslated at the origin. Given a state , following equations (1) and (2), our simulator uses a semi-implicit Euler update scheme to compute subsequent state .
Computing the grasp metric by simulation. To measure the quality of a candidate grasp , we test its ability to withstand forces applied to the object. Given an initial state , we apply an initial velocity to the object. The hand is kept static, with . We run forward simulation to compute the object’s final velocity . A stable grasp will produce contact forces that resist the object velocity, so lower indicates a more stable grasp. In fact, a stable grasp should be able to resist object velocities in any direction, so we perform multiple simulations with different initial velocities and average the results. This suggests the following basic grasp metric: for each set of simulations, indexed by , we set a different initial object velocity, run the simulation, and record . Then, averaging, we have
(7) |
Since is a differentiable function of the output of a differentiable simulation, it is itself differentiable with respect to , and we can compute loss gradients and use gradient-based optimization to find stable grasps.
Unfortunately, in practice, this basic procedure does not succeed. As explained at the beginning of Section 3, the grasp optimization landscape is extremely rugged, with sharp and narrow ridges, peaks, and valleys. Our leaky contact force gradients (see Section 3.3) provide some help in escaping plateaus, but once the hand is in contact with the object, small changes in hand configuration still cause large jumps in contact forces by making/breaking contacts and shifting contact normals. However, differentiability alone does not resolve this issue.
Problem relaxation. Inspired by Contact-Invariant Optimization [mordatch2012discovery, mordatch2012contact] we relax the problem making it more forgiving to gradient-based optimization. Specifically, we introduce additional desired or prescribed contact force variables. This allows us to model physics violations as a cost rather than a constraint. For each surface point on the hand , we introduce a 6-dimensional vector representing the desired hand-object contact wrench arising from contact at .
Our overall loss now has two components. The task loss measures whether the prescribed forces successfully resist initial object velocities. This is computed identically to the previous , except that instead of computing contact forces according to equations (3), (4) and (5), contact forces are simply set equal to . The physics violation loss measures whether the hand configuration actually provides the desired forces . It is computed as
(8) |
where is the contact force arising from the hand pose according to equations (3), (4) and (5).
Intuitively, minimizing these losses corresponds to finding a set of desired forces (as close as possible to the actual contact forces arising from the current hand configuration) that complete the task, and finding a hand configuration that provides those forces. We expect problem formulations derived from and inspired by Contact-Invariant Optimization [ciocarlie2007dexterous, ciocarlie2009hand] to be a fruitful area of research as they are made newly attractive by advances in differentiable simulation.
Additional heuristic losses. We include some additional losses that improve the plausibility of resulting grasps. Most hand models have defined joint range limits. Let and be the lower and upper joint limits respectively. encourages hand joints to be near the middle of their ranges. penalizes hand joints outside of their range. penalizes self intersections of the hand.
(9) | ||||
(10) | ||||
(11) |
The hand is kinematic, so it is not subject to contact forces. However, we still compute forces arising from contact between the hand links, for use in this loss term, as . We ignore contacts between neighbouring links in the chain. For the purpose of computing , we represent each hand link as both a point set and an SDF and compute according to equations (3), (4), and (5).
3.5 Optimization
We use the Modified Differential Multiplier Method [platt1987constrained], treating and as constraints, while minimizing , and . We update our parameters and using the Adamax [kingma2014adam] optimizer. Details of learning rates, and can be found in Appendix A.7.
4 Experiments
Our evaluations and analysis of Grasp’D answer the following questions:
4.1 Experimental setup
For each experiment, we synthesize grasps following the procedure described in Section 3. We compute the metric with simulations: each setting a different initial velocity on the hand: , or m/s. Each simulation is run for a single timestep of length s.
Evaluation metrics. We follow [hasson2019learning] and use contact area (CA), intersection volume (IV), and the ratio between contact area and intersection volume (). We compute evaluation metrics that measure grasp stability and contact area. In addition, we measure the contact area each grasp creates and the volume of hand-object interpenetration. We compute two analytic measures of stability – the Ferrari-Canny (epsilon ) [ferrari1992planning] and the volume metric (Vol) – and one simulated measure: the simulation displacement (SD) metric introduced in [hasson2019learning].
Hand parameterization. We use a differentiable PyTorch layer [hasson2019learning] to compute the 773 vertices of the MANO hand [MANO:SIGGRAPHASIA:2017] model. The input is a set of weights for principal components extracted from the MANO dataset of human scans [MANO:SIGGRAPHASIA:2017]. We find that this PCA parameterization provides a useful prior for human-like hand poses. We use the maximum number of principal components (44).
Method | CA | IV | Vol | SD | ||
---|---|---|---|---|---|---|
Scale (Unit) | cm | |||||
ObMan [hasson2019learning] (top2) | ||||||
ObMan [hasson2019learning] (top5) | 1.05 | |||||
Grasp’D (top2) | 7.55 | 0.59 | ||||
Grasp’D (top5) | 41.4 | 7.55 | 5.02 | 1.46 |
4.2 Grasp synthesis with ShapeNet models
We compare to baseline grasps from the ObMan [hasson2019learning] dataset, which generates grasps with the GraspIt! [miller2004graspit] simulator using an analytic metric. We report these metrics over the top-2 and top-5 grasps per scaled object, with ranking decided by simulation displacement for our method and by ObMan’s heuristic measure (detailed in Appendix C.2 of [hasson2019learning]) for theirs. Further details in Appendix A.6.
Data. We evaluate our approach to grasp synthesis by generating grasps with the MANO human hand [MANO:SIGGRAPHASIA:2017] model for 57 ShapeNet [chang2015shapenet] objects that span 8 categories (bottles, bowls, cameras, cans, cellphones, jars, knives, remote controls), and are each considered at 5 different scales (as in ObMan). See the Appendix A for details of mesh pre-processing, initialization, simulation, and optimization.
Results. Results are presented in Table 2. Grasps generated by our method (both top-2 and top-5) have a contact area of around . This is higher than the area achieved with fingertip only grasps [brahmbhatt2019contactdb] and about higher than grasps from the ObMan dataset (top-2 or top-5). These contact-rich grasps achieve modest improvements in analytic measures of stability, and a significant reduction in simulation displacement ( for top-2 grasps). Visualizations of our generated grasps in Figure 1 confirm that these grasps achieve larger areas of contact by closely conforming to object surface geometry, whereas the analytically generated grasps largely make use of fingertip contact only. These higher contact grasps have accordingly higher interpenetration, but the ratio between contact area and intersection volume is similar to the ObMan baseline.
4.3 Grasp synthesis from RGB-D input of unknown objects
Setting. One possible application of our method is to direct grasp prediction from RGB-D images by simulation on reconstructed object models. Currently, our method is too slow to be used online (about 5 minutes per grasp), but as simulation speeds increase and recent works in implicit fields push reconstruction accuracy higher and higher, we believe that grasp prediction by simulation models will become increasingly viable. To validate the plausibility of using our method with reconstructed object models, we present results from running our system on meshes reconstructed from RGB-D inputs. We synthesize grasps based on RGB-D (with camera pose) inputs from the YCB object dataset [calli2017yale]. In addition to reconstructed meshes, the YCB dataset provides the original RGB-D captures the meshes are based on. Each object was captured from 5 different cameras at 120 different angles for a total of 600 images. To confirm that our method can work with reconstructions done under more realistic assumptions, we limit our reconstructions to using 5 different angles from 3 cameras (2.5% of captures).
Data. For a subset of the YCB objects, we generate Poisson surface reconstructions and use our method to synthesize MANO hand grasps. Since the inputs are from cameras with a known pose, the object reconstruction is in the world frame. Details in the Appendix A.4.
Results. Our results confirm the viability of using simulation to synthesize grasps on reconstructed object models. Qualitative results are presented in Figure 4; additional results can be found in Appendix D. Although synthesis does not perform as well as with ground-truth models, plausible human grasps are discovered for many objects and the grasps appear well-aligned with the real-world object poses. Future work could take advantage of learning-based reconstruction methods to achieve grasp synthesis with fewer input images.
Method | CA | IV | Vol | SD | ||
---|---|---|---|---|---|---|
Scale/Unit | cm | |||||
Grasp’D | ||||||
Grasp’D w/o coarse-to-fine | ||||||
Grasp’D w/o problem relaxation |
4.4 Ablation study
We investigate the impact of our coarse-to-fine smoothing (Section 3.2), leaky contact force gradients (Section 3.3), and relaxed problem formulation (Section 3.4). We generate MANO hand grasps on 21 objects from the YCB dataset [calli2017yale]. Grasp’D w/o coarse-to-fine does not pad or smooth the object. Grasp’D w/o problem relaxation attempts to solve the problem without introducing additional force variables or a relaxed objective. This amounts to the “basic procedure” described in Section 3.4, i.e., directly optimize over hand pose to minimize and the heuristic losses.
Results. We adopt the same data as in Section 4.2. Table 3 presents the results. Our relaxed problem formulation is key to our method’s success, and without it, performance greatly degrades by all measures, with discovered grasps creating very little contact (low contact area and intersection volume). Coarse-to-fine smoothing has a modest impact, with all metrics comparable with or without smoothing, except for simulation displacement, which is about 25% higher without smoothing. We did not include a variant without leaky gradient, since this variant would never make contact with the object (if the hand is not touching the object at initialization, there will be no gradient to follow and optimization will immediately be stuck in a plateau).
5 Conclusions
We presented a simulation-based grasp synthesis pipeline capable of generating
large datasets of plausible, high-contact grasps.
By being differentiable, our simulator is amenable to gradient-based optimization,
allowing us to produce high-quality grasps, even for multi-finger grippers, while scaling to
thousands of dense contacts.
Our experiments have shown that we outperform the existing classical grasping algorithm both quantitatively and qualitatively.
Our approach is compatible with PyTorch and can be easily integrated into existing pipelines.
More importantly, the produced grasps can directly benefit any vision pipeline that learns grasp prediction from synthetic data.
Grasp’D: Differentiable Contact-rich Grasp Synthesis for Multi-fingered Hands
Supplementary Material
Dylan Turpin Liquan Wang Eric Heiden Yun-Chun Chen Miles Macklin Stavros Tsogkas Sven Dickinson Animesh Garg
11institutetext: University of Toronto, Vector Institute, Nvidia, Samsung
11email: dylanturpin@cs.toronto.edu
Overview
In this supplementary material, we provide additional details and results to complement the main paper. Specifically:
-
We describe the details of our implementation and experimental setting. (Appendix 0.A).
-
We provide additional results of our method applied to the YCB dataset [calli2017yale] with both a human MANO hand model [MANO:SIGGRAPHASIA:2017] and a robotic Allegro hand model. (Appendix 0.B).
-
We provide visualizations of optimization trajectories for MANO hand grasps of YCB and ShapeNet objects, which show how grasps improve as optimization progresses. (Appendix 0.C)
-
We provide additional results for the validation of grasp synthesis with RGB-D reconstruction presented (Section 4.3 of the main paper). (Appendix 0.D).
Appendix 0.A Details of implementation and experiments
0.a.1 Dataset listings
0.a.2 Initialization and smoothing schedule
0.a.2.1 Initialization.
Since our grasp synthesis pipeline relies on gradient-based optimization, the final result depends on how the parameters are initialized, i.e., different initial hand poses will recover different final grasps. This is a useful quality in that it allows us to sample a variety of grasps for each object by sampling different starting poses. The force variables are always initialized to zero. We employ a simple heuristic (adapted from [brahmbhatt2019contactgrasp]) to initialize the hand pose . We set all hand joints to their fully open position. To find an initial rotation and position for the hand base link, we uniformly sample an approach point on the object surface and a roll angle around the approach vector. We use an approach vector opposing the object surface normal at the approach point, and set the hand rotation such that the palm’s normal is aligned with the approach vector. Finally we apply the sampled roll around the approach vector. We set the hand position so that the palm’s center is at a cm distance from the approach point along the approach vector.
0.a.2.2 Coarse-to-fine smoothing schedule.
We set the initial value for the coarse-to-fine smoothing radius to the distance between the object and the closest point on the hand less cm. The radius is then decreased to on a linear schedule over the first 5,000 steps of a 7,000 step optimization and remains at for the final 2,000 steps. The early steps of the optimization find a rough pose for the hand (where on the object to grasp and an approximate finger configuration) and the later steps optimize over fine-grained geometry, allowing the discovery of grasps that conform closely to detailed surface geometry.
0.a.3 Mesh processing
We use a discretized SDF representation as described in Section 3.2 of the main paper. Computing the SDF involves some preprocessing. For the experiments in Section 4.2 and 4.4 (on complete ShapeNet and YCB meshes respectively), the input is a mesh from the relevant dataset. For the RGB-D reconstruction experiment in Section 4.3, the input is a reconstructed mesh (see Appendix 0.A.4 for details of the reconstruction pipeline). To compute the sign of the SDF at a given query point, we must determine whether that point is inside or outside the object. This is more straightforward if the mesh consists of a single closed surface, so we first run ManifoldPlus [huang2020manifoldplus] to compute a watertight mesh. Next, a () grid of points is evenly sampled over the mesh bounding box (padded by cm) and the signed distance of each point to the mesh is computed using libigl [jacobson2017libigl].
0.a.4 Reconstruction pipeline
We describe the RGB-D reconstruction pipeline used in Section 4.3. The YCB object dataset includes RGB-D captures for each object. The object is placed on a spinning platter surrounded by 5 cameras and is captured at each of 120 different angles as the plate is rotated in 3 degree increments. We take 15 of these depth images (captures from the first, third and fifth camera at 5 angles in degree increments). We run the code provided alongside the YCB dataset in order to register the depth maps and combine them into a single world frame point cloud. We create a Poisson reconstruction [kazhdan2006poisson] of this point cloud using the Open3D library [Zhou2018] with a depth of . The resulting mesh is still incomplete because the bottom of the object is not visible (since it is the contact surface between the table and the object) We use PyMeshFix [sullivan2019pyvista] to complete this and any other remaining holes in the mesh.
0.a.5 Simulation details
We run each simulation for a single timestep of length seconds. For the MANO hand model, all vertices are used as contact locations. For the Allegro hand model, we sample 3000 points on the surface to use as contact locations. In all experiments we set the normal stiffness to , frictional stiffness to , and the friction coefficient to . For the leaky gradient (described in Section 3.3 of the main paper) we set the proportion of gradient that leaks through non-colliding contact forces to . Note that the above applies to simulation during grasp optimization. When we compute simulation displacement for evaluation purposes, we do not use our own simulator, but instead use PyBullet [coumans2021] (details in Appendix 0.A.6).
0.a.6 Evaluation details
We evaluate grasps in terms of their contact patterns and stability.
0.a.6.1 Interpenetration volume
is the volume of the intersection between the hand and the object. Lower values are better (since in reality the hand cannot penetrate the hard object). We compute this by voxelizing the hand (with 1mm resolution) and querying the object’s SDF at each voxel position to decide if each hand voxel is overlapping the object or not.
0.a.6.2 Contact area
is the area of surface contact (in ) between the hand and the object. This is computed similarly to interpenetration volume, except that only the hand surface is voxelized (i.e., the hand is treated as an empty shell, not a solid volume).
0.a.6.3 Contact area to interpenetration volume ratio.
Interpenetration can be avoided by simply avoiding contact with the object entirely, so there is a trade off between interpenetration volume and the other metrics. To capture the amount of interpenetration, conditional on the amount of contact, we report the ratio of contact area to interpenetration.
0.a.6.4 (Ferrari-Canny) metric
measures grasp stability using the magnitude of the smallest force that can break a grasp. A more stable grasp can withstand larger forces, so a larger force will be needed to break the grasp. This quantity is equivalent to the size of the largest origin-centered ball contained in the Grasp Wrench Space (GWS [ferrari1992planning]). The GWS is the space of wrenches the contacts induced by the grasp can withstand, assuming that the total hand-object wrench will be a linear combination of the wrenches at each contact with coefficients summing to . Under a Coulomb friction model, the possible wrenches at each contact are defined by a friction cone, which we approximate by a pyramid.
0.a.6.5 Volume metric
is an alternate measure of stability that considers all the possible forces a grasp can withstand (instead of just the magnitude of the smallest force that breaks the grasp). The volume metric is simply the volume of the GWS.
0.a.6.6 Simulation displacement
is a simulation-based, rather than analytic, measure of stability. We use GANHand’s implementation [corona2020ganhand] of a simulation displacement metric in PyBullet [coumans2021] to measure grasp stability by checking how far the object is displaced from its initial pose when the grasp is applied. We use the default physics parameters provided by GANHand except for setting the friction coefficient to (instead of the seemingly high default of ). Whereas we train (i.e., optimize) our grasps in our own custom simulator, this metric is computed in a widely used third-party simulator (PyBullet), with a different collision detector, contact model and time stepping scheme. This avoids giving ourselves an unfair advantage by training (optimizing) and testing (computing evaluation metrics) with the same contact model and physics engine (which baselines we compare to did not have access to).
0.a.6.7 True signed distances.
Whenever a metric relies on the object SDF (e.g., to compute contact forces or to determine if a voxel is intersecting the object or not), we compute that SDF with libigl [jacobson2017libigl] using the ground truth mesh instead of a discrete grid approximation of the SDF.
0.a.6.8 Evaluating on top grasps.
For our method, we report metrics for the top 2 and top 5 grasps (ordered by simulation displacement – details below) for each object. For the ObMan dataset, we report the top 2 and top 5 grasps for each object (ordered by their heuristic measure, described below). The ObMan generation procedure uses the GraspIt! simulator to synthesize grasps by optimizing (with simulated annealing) over an analytic metric. Many grasps for each object are generated by running about 70k steps annealing steps. The top 2 grasps are then selected according to a heuristic measure (see Appendix C.2 of [hasson2019learning]) which encourages palm and phalange contact. This heuristic was explicitly added to compensate for the bias of analytic synthesis towards fingertip-only grasps. To test our own method, we generate 10 grasps for each object, each using 7000 optimization steps. We report these metrics over the top 2 and top 5 grasps with the lowest simulation displacement.
0.a.7 Optimization details
We used the ADAMax [zhang2018improved] optimizer to update the the hand pose parameters (with a learning rate of ) and force parameters (with a learning rate of ). Some objectives are more important than others, so are treated as constraints to satisfy rather than costs to minimize. Specifically, we use the Modified Differential Multiplier Method [platt1987constrained], treating and as constraints, while minimizing , and . We set and . Damping is set to for all constraints. During MANO hand experiments, we do not use the joint limit loss or joint limit constraint, as these limits appear to be well-handled implicitly by the PCA parameterization. Similarly, we do not compute the self-intersection loss for the MANO hand, yet recover grasps with low self-intersection due to the hand parameterization. All losses are used enabled for the Allegro hand.
0.a.8 Timing
On a mobile Nvidia RTX 2070, generating a MANO hand grasp for a YCB object (by taking 7,000 optimizer steps) takes about 5 minutes. The MANO hand has only 773 vertices, so the memory footprint of the simulation is limited and three grasps can be synthesized in parallel, reducing average grasp synthesis time to about 2 minutes. While not yet approaching realtime performance, this is comparable to the speed of analytic synthesis with the GraspIt! simulator [miller2004graspit], which takes around 5 minutes [corona2020ganhand] to synthesize a grasp when using the eigengrasp planner with simulated annealing as for the ObMan dataset [hasson2019learning].
Appendix 0.B YCB results
Appendix 0.C Optimization trajectories
Appendix 0.D Additional RGB-D results
We provide additional qualitative results of applying our methods to objects reconstructed from RGBD images in the YCB dataset. Figure 9 shows 3 synthesized grasps for each object visualized from 2 different viewpoints. We also provide quantitative results for our RGB-D experiment (section 4.3 of the main paper) in Table 7.
Input | CA | IV | SD | |
---|---|---|---|---|
GT-Mesh | 42.6 | 2.83 | 15.1 | 0.41 |
RGB-D |
Appendix 0.E Training on synthesized data
Fine-tuning Grasping Field [karunratanakul2020grasping] with data generated by Grasp’D improves performance on unseen YCB objects better than additional GraspIt! [miller2004graspit] data. Table 8 displays the result of fine-tuning a pre-trained Grasping Field network with additional data synthesized by either the GraspIt! simulator or Grasp’D. The network is first trained for 1400 epochs on the ObMan dataset [hasson2019learning] (of synthetic GraspIt! grasps) and then fine-tuned for 100 epochs on 1000 new grasps (of ShapeNet [chang2015shapenet] objects already included in the ObMan dataset) before final testing on 8 objects from the YCB set. Fine-tuning with Grasp’D data results in significantly higher-contact grasps. This comes with a slight increase in intersection volume, but the ratio of contact area to intersection is improved, as is the simulation displacement.
Data source | CA | IV | SD | |
---|---|---|---|---|
GraspIt! [58] | 9.44 | |||
Grasp’D | 21.00 | 1.78 | 2.88 |
Acknowledgements. DT was supported in part by a Vector research grant. The authors appreciate the support of NSERC, Vector Institute and Samsung AI. AG was also supported by NSERC Discovery Grant, NSERC Exploration Grant, CIFAR AI Chair, XSeed Discovery Grant from University of Toronto.