Grasp’D: Differentiable Contact-rich
Grasp Synthesis for Multi-fingered Hands

Dylan Turpin 1

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 2233 Liquan Wang 1

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 2233 Eric Heiden 33 Yun-Chun Chen 1

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 22
Miles Macklin 33 Stavros Tsogkas 44 Sven Dickinson 1

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 2244 Animesh Garg 1

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 22331

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 22331

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 22331

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 2233441

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 22441

^{1}

University of Toronto,

^{2}

Vector Institute,

^{3}

Nvidia,

^{4}

Samsung
1dylanturpin@cs.toronto.edu 2233

Abstract

The study of hand-object interaction requires generating viable grasp poses for high-dimensional multi-finger models, often relying on analytic grasp synthesis which tends to produce brittle and unnatural results. This paper presents Grasp’D, an approach for grasp synthesis with a differentiable contact simulation from both known models as well as visual inputs. We use gradient-based methods as an alternative to sampling-based grasp synthesis, which fails without simplifying assumptions, such as pre-specified contact locations and eigengrasps. Such assumptions limit grasp discovery and, in particular, exclude high-contact power grasps. In contrast, our simulation-based approach allows for stable, efficient, physically realistic, high-contact grasp synthesis, even for gripper morphologies with high-degrees of freedom. We identify and address challenges in making grasp simulation amenable to gradient-based optimization, such as non-smooth object surface geometry, contact sparsity, and a rugged optimization landscape. Grasp’D compares favorably to analytic grasp synthesis on human and robotic hand models, and resultant grasps achieve over 4× denser contact, leading to significantly higher grasp stability. Video and code available at: graspd-eccv22.github.io.

Keywords:

Multi-finger grasping, grasp synthesis, vision-based grasping

Figure 1: Multi-finger grasp synthesis with Differentiable Simulation. Analytically synthesized grasps, such as in ObMan [hasson2019learning] based on the GraspIt! [miller2004graspit], plan sparse contacts at the fingertips. Our method (Grasp’D) for grasp synthesis discovers stable, contact-rich grasps that conform to detailed object surface geometry. Grasp’D creates larger contact-areas that better match the contact distribution of real human grasps.

1 Introduction

Humans use their hands to interact with objects of varying shape, size, and material thousands of times throughout a single day. Despite being effortless – almost instinctive – these interactions employ a complex visuomotor system, with components that correspond to dedicated areas of computer vision research. Visual inputs from the environment are processed in our brain to recognize objects of interest (object recognition [viola2001rapid, dalal2005histograms, felzenszwalb2009object, girshick2014rich, duan2019centernet]), identify modes of interaction to achieve a certain function (affordance prediction [brahmbhatt2019contactdb, do2018affordancenet, lau2016tactile, porzi2016learning, roy2016multi]), and position our hand(s) in a way that enables that function (pose estimation [hamer2009tracking, supanvcivc2018depth, zimmermann2017learning, baek2019pushing, boukhayma20193d, ge20193d], grasping [kokic2020learning, fang2018tog-ijrr, turpin2021gift]). Proficiency in this task comes from accumulated experience in interacting with the same object over time, and readily extends to new categories or different instances of the same category.

This is an intriguing observation: humans can leverage accumulated knowledge from previous interactions, to quickly infer how to successfully manipulate an unknown object, purely from visual input. Granting machines the same ability to directly translate visual cues into plausible grasp predictions can have significant practical implications in the way robotic manipulators interact with novel objects [saxena2006robotic, fang2018tog-ijrr] or in virtual environments in AR/VR [de2017human, gammieri2017coupling].

Grasp prediction has previously been considered in the context of computer vision [yang2015grasp, nakamura2017complexities, huang2015we, heumer2007grasp] and robotics [pirk2017understanding]. It amounts to predicting the base pose (position and rotation) and joint angles of a robotic or human hand that is stably grasping a given object. This prediction is usually conditioned on visual inputs, such as RGB(D) images, point clouds, etc., and is typically performed online for real-time applications. Predicting grasps from visual inputs can be naturally posed as a learning problem, using paired visual data with their respective grasp annotations. However, capturing and annotating human grasps is laborious and not applicable to robotic grasping, so researchers often rely on datasets of synthetically generated grasps instead (see Table 1 for a list of recent works). Consequently, high-quality datasets of plausible, diverse grasps are crucial for any modern vision system performing grasp prediction, motivating the development of better methods for grasp synthesis.

Grasp synthesis assumes that the complete object geometry (e.g., mesh) is known, and is usually achieved by optimizing over a grasping metric which can be computed analytically or through simulation. Analytic metrics are handcrafted measures of a grasp’s quality. For example, the epsilon metric [ferrari1992planning] measures the magnitude of the smallest force that can break a grasp, computed as a function of the contact positions and normals that the grasp induces. While analytic metrics can be computationally faster, they often transfer poorly to the real world. Simulation-based metrics [eppner2021acronym, kappler2015leveraging, zhou20176dof] measure grasp quality by running a simulation to test grasp effectiveness, e.g., by shaking the object and checking whether it is dropped. These can achieve a higher degree of physical fidelity, but require more computation. In both cases, optimization is usually black box, as neither the analytic metric or simulator is differentiable. Black box optimization can find good grasps in a reasonable number of steps as long as the search space is low-dimensional, e.g., when searching the pose space of parallel-jaw grippers [eppner2021acronym, veres2017integrated, depierre2018jacquard, mousavian20196, eppner2019billion]. However, when the number of degrees of freedom becomes larger, as in the case of multi-finger grippers, black box optimization over a grasping metric (whether analytic or simulation-based) becomes infeasible. Simplifying assumptions can be made to reduce the dimensionality of the search space, but they often reduce the plausibility of generated grasps.

Year Name Hand Model(s) Analytic (A) or Human Capture (HC) 2019 ObMan [hasson2019learning] MANO A (GraspIt! [miller2004graspit]) 2019 ContactDB [brahmbhatt2019contactdb] MANO HC 2020 Hope-net [doosti2020hope] MANO A (ObMan [hasson2019learning]) 2020 UniGrasp [shao2020unigrasp] Various A (FastGrasp [pokorny2013classical]) 2020 ContactPose [brahmbhatt2020contactpose] MANO HC 2020 GANHand [corona2020ganhand] MANO Other (manual) 2020 Grasping Field [karunratanakul2020grasping] MANO A (ObMan) 2020 GRAB [taheri2020grab] MANO HC 2021 Multi-Fin GAN [lundell2020multi] Barrett A (GraspIt!) 2021 DDGC [lundell2021ddgc] Barrett A (GraspIt!) 2021 Contact-Consistency [jiang2021hand] MANO A (ObMan)

Table 1: Modern vision-based grasp prediction for multi-finger hands relies on datasets created by human capture or analytic synthesis. Human capture is expensive and does not address the need for robotic grasp datasets. Analytic synthesis is only practical under significant limiting assumptions that exclude key grasp types [corona2020ganhand, hasson2019learning].

To address these shortcomings, we propose Grasp’D, a grasp synthesis pipeline based on differentiable simulation which can generate contact-rich grasps that realistically conform to object surface geometry without any simplifying assumptions. A metric based on differentiable simulation admits gradient-based optimization, which is sample-efficient, even in high-dimensional spaces, and affords all the benefits of simulation-based metrics, i.e., physical plausibility, scalability, and extendability. Differentiable grasping simulation, however, also presents new challenges. Non-smooth object geometry (e.g., at the edges or corners of a cube) results in discontinuities in the contact forces and, subsequently, our grasping metric, complicating gradient-based optimization. Adding to that, if the hand and the object are not touching, small perturbations to the hand pose do not generate any additional force, resulting in vanishing gradients. Finally, the optimization landscape is rugged, making optimization challenging. Once the hand is touching the object, small changes to the hand pose may result in large changes to contact forces (and our metric).

We address these challenges as follows: (1) At the start of each optimization, we simulate contact between the hand and a smoothed, padded version of the object surface that gradually resolves to the true, detailed surface geometry, using a coarse-to-fine approach. This smoothing softens discontinuities in surface normals, allowing gradient-based optimization to smoothly move from one continuous surface area to another. This is enabled by our signed-distance function (SDF) approach to collision detection, which lets us freely recover a rounded object surface as the radius $r$ level set of the SDF. (2) We allow gradients to leak through force computations for contact points that are not yet in the collision, introducing a biased gradient that can be followed to create new contacts. The intuition behind this choice is similar to the one for using LeakyReLU activations to prevent the phenomenon of “dying neurons” in deep neural networks [maas2013rectifier]. (3) Inspired by Contact-Invariant Optimization (CIO) [mordatch2012discovery, mordatch2012contact], we relax the problem formulation by introducing additional force variables that allow physics violations to be treated as a cost rather than a constraint. In effect, this decomposes the problem into finding contact forces that solve the task (of keeping the object stably in place) and finding a hand pose that provides those forces. We evaluate our method on synthetic object models from ShapeNet [chang2015shapenet] and object meshes reconstructed from the YCB RGB-D dataset [calli2017yale]. Experimental results show that our method generates contact-rich grasps with physical realism and with favorable performance against an existing analytic method [hasson2019learning].

Figure 1 displays example grasps generated by our method side-by-side with grasps from [hasson2019learning]. Because we do not make assumptions about contact locations or reduce the dimensionality of the search space, our method can discover contact-rich grasps that are more stable and more plausible than the fingertip-only grasps usually discovered by analytic synthesis. The same procedure works equally for robotic hands. Figure 2 displays snapshots of an optimization trajectory for an Allegro hand. As optimization progresses and our simulated metric decreases, the grasp becomes increasingly stable, plausible, and high-contact.

1.0.1 Summary of contributions:

We propose a differentiable simulation-based protocol for generating synthetic grasps from visual data. Unlike other simulation-based approaches, our method can scale to tens of thousands of dense contacts, and discover plausible, contact-rich grasps, without any simplifying assumptions.
We address challenges arising from the differentiable nature of our scheme, using a coarse-to-fine SDF collision detection approach, defining leaky gradients for contact points that are not yet in collision, and integrating physics violations as additional terms to our cost function.
We show that our method finds grasps with better stability, lower interpenetration, and higher contact area when compared to analytic grasp synthesis baselines, and justify our design choices through extensive evaluations.

Figure 2: Our method can synthesize grasps for both human and robotic hands, such as the four-finger Allegro hand in this figure. After hand initialization, we run gradient-based optimization to iteratively improve the grasp, in terms of stability and contact area. We include additional examples in Appendix B.

2 Related Work

Grasp synthesis. Although analytic metrics have been successfully applied to parallel-jaw gripper grasp synthesis (based on grasp wrench space analysis [ferrari1992planning, miller2004graspit, goldfeder2009columbia], robust grasp wrench space analysis [weisz2012pose, mahler2017dex], or caging [rodriguez2012caging, mahler2016energy]), more recent works [depierre2018jacquard, kappler2015leveraging, mousavian20196, eppner2021acronym] have focused on simulation-based synthesis. While they are more computationally costly, simulation-based metrics for parallel-jaw grasps better align with human judgement [kappler2015leveraging] and with real world performance [mahler2019learning, danielczuk2019reach, mousavian20196, eppner2021acronym]. In contrast to parallel-jaw grippers, multi-finger grasp synthesis is still largely analytic, with many recent works in multi-finger robotic grasping [shao2020unigrasp, lundell2020multi, lundell2021ddgc], grasp affordance prediction [karunratanakul2020grasping], and hand-object pose estimation [hasson2019learning, doosti2020hope, jiang2021hand] relying on datasets of analytically synthesized grasps (see Table 1). Notably, [lundell2020multi, lundell2021ddgc, karunratanakul2020grasping, hasson2019learning, doosti2020hope, jiang2021hand] all use datasets synthesized with the GraspIt! [miller2004graspit] simulator, which is widely used for both multi-finger robotic and human grasp synthesis. The ObMan dataset [hasson2019learning] for hand-object pose estimation (also used in [karunratanakul2020grasping, jiang2021hand]) is constructed by performing grasp synthesis with the MANO hand [MANO:SIGGRAPHASIA:2017] in the GraspIt! Eigengrasp planner, and rendering the synthesized grasps against realistic backgrounds. The GraspIt! Eigengrasp planner optimizes analytic metrics based on grasp wrench space analysis. Dimensionality reduction [ciocarlie2007dexterous] in the hand joint space, or using pre-specified contact locations for each hand link can be used to make the problem more tractable, but this limits the space of discoverable grasps and requires careful tuning. Our approach can successfully operate in the full grasp space, eschewing such simplifying assumptions while excelling in terms of physical fidelity over analytic synthesis for multi-finger grippers.

Human grasp capture. To estimate human grasps from visual inputs, existing methods train models on large-scale datasets [brahmbhatt2019contactdb, brahmbhatt2020contactpose, hampali2020honnotate, garcia2018first, tzionas2016capturing]. Collecting these datasets puts humans in a lab with precise, calibrated cameras, lidar, and special gloves for accurately capturing human grasp poses. A human in the loop may also be needed for collecting annotations. All these requirements make the data collection process expensive and laborious. In addition, the captured grasps are only appropriate for human hands and not for robotic ones (which are important for many applications [chen2022system, allshire2021transferring]). Some works [lakshmipathy2022contact, brahmbhatt2019contactgrasp] aim to transfer human grasps to robotic hands by matching contact patterns, but these suffer from important limitations, since the same contacts may not be achievable by human and robotic hands, given differences in their morphology and articulation constraints (e.g., see Fig. 8 of [lakshmipathy2022contact]). Our method provides a procedural way of generating high quality grasps for any type of hand – human or robotic.

Vision-based grasp prediction. Whereas grasp synthesis is useful for generating grasps when full object geometry is available (i.e., a mesh or complete SDF is given), practical scenarios require predicting grasps from visual input. GANHand [corona2020ganhand] learns to predict human grasp affordances (as poses of a MANO [MANO:SIGGRAPHASIA:2017] hand model) from input RGBD images using GANs. Since analytic synthesized datasets do not include many high-contact grasps, the authors also released the YCB Affordance dataset of 367 fine-grained grasps of the YCB object set [calli2017yale], created by manually setting MANO hand joint angles in the GraspIt! simulator’s GUI. Rather than predicting joint angles, Grasping Field [karunratanakul2020grasping] takes an implicit approach to grasp representation by learning to jointly predict signed distances for the MANO hand and the object to be grasped. For parallel-jaw grippers, most recent works [mahler2019learning, mousavian20196, sundermeyer2021contact, jiang2021synergies] learn from simulation-based datasets (e.g., [kappler2015leveraging, eppner2021acronym]). In contrast, multi-finger grasp prediction systems are still trained on either analytically synthesized datasets or datasets of captured human grasps (see Table 1). [lundell2020multi, lundell2021ddgc, karunratanakul2020grasping, hasson2019learning, doosti2020hope, jiang2021hand] all use analytically synthesized datasets from the GraspIt! simulator [miller2004graspit], whereas [brahmbhatt2019contactdb, brahmbhatt2020contactpose, taheri2020grab] use datasets of captured human grasps. [grady2021contactopt, jiang2021hand] use captured human grasps to train a contact model, then refine grasps at test-time by optimizing hand pose to match predicted contacts. The higher quality training data generated by our grasp synthesis pipeline can lead to improved performance for any of these vision-based grasping prediction systems. Our system can also be used directly for vision-based grasp prediction, by running simulations with reconstructed objects (see Section 4.3).

Differentiable Grasping. We know of two works that have created differentiable grasp metrics in order to take advantage of gradient-based optimization for multi-finger grasp synthesis. [liu2020deep] formulates a differentiable version of the epsilon metric [ferrari1992planning] and uses it to synthesize grasps with the shadow robotic hand. They formulate the epsilon metric computation as a semidefinite programming (SDP) problem. Sensitivity analysis on this problem can then provide the gradient of the solution with respect to the problem parameters, including gripper pose. They manually label $45$ potential contact points on the gripper. In contrast, we are able to scale to tens of thousands of contact points. Since the gripper may not yet be in contact with the object, they use an exponential weighting of points. Liu et al. [liu2021synthesizing] formulate a differentiable force closure metric and use gradient-based optimization to synthesize grasps with the MANO [MANO:SIGGRAPHASIA:2017] hand model. Their formulation assumes zero friction and that the magnitude of all contact forces is uniform across contact points (although an error term allows both of these constraints to be slightly violated). Our method requires neither of these assumptions: the user can specify varying friction coefficients, and contact forces at different points are free to vary realistically. Their optimization problem involves finding a hand pose and a subset of candidate contact points on the hand that minimize an energy function. They find that the algorithm performs better with a smaller number of contact points and candidates. Selecting 3 contact points from the 773 candidate vertices of the MANO hand, it takes about 40 minutes to find 5 acceptable grasps. In contrast, our method is able to scale to tens of thousands of contact points while synthesizing an acceptable grasp in about 5 minutes. Notably, both of these prior works aim to take an analytic metric (the epsilon metric [ferrari1992planning]) and make a differentiable variant. In contrast, we are presenting a differentiable simulation-based metric, which prior work on parallel-jaw grippers suggests will have greater physical fidelity [danielczuk2019reach, mousavian20196, eppner2021acronym] and better match human judgements [kappler2015leveraging] than analytic metrics.

Differentiable Physics. There has been significant progress in the development of differentiable physics engines [hu2019chainqueen, hu2019difftaichi, geilinger2020add, brax2021github, werling2021fast, qiao2021efficient, heiden2021neuralsim, heiden2021disect, xie2022shac]. However, certain limitations in recent approaches render them inadequate. Brax [brax2021github] and the Tiny Differentiable Simulator [heiden2021neuralsim] only support collision primitives and cannot model general collisions between objects. Nimblephysics [werling2021fast] supports mesh-to-mesh collision, but cannot handle cases where the gradient of contact normals with respect to position is zero (e.g., on a mesh face). While its analytic computation of gradients is fast, Nimblephysics requires manually writing forward and backward passes in C++, and only runs on CPU. Our work presents a new class of differentiable physics simulators to addresses many of these shortcomings. Further, Grasp’D supports GPU parallelism, enabling us to scale to tens of thousands of contacts, effectively approximating surface contacts.

Figure 3: Method overview. Grasp’D takes as input the discretized-SDF of an object (computed from a mesh or reconstructed from RGB-D) and synthesizes a stable grasp that can hold the object static as we vary the object’s initial velocity. We optimize jointly over a hand pose $u^{(T)}$ and the stabilizing forces ${^f}_{c}$ provided by its contacts.

3 Grasp’D: Differentiable Contact-rich Grasp Synthesis

We present a method for solving the grasp synthesis problem (Figure 3). From an input object and hand model (represented respectively by a signed-distance function and an articulation chain with mesh links), we generate a physically-plausible stable grasp, as a base pose and joint angles of the hand. This is achieved by iterative gradient-based optimization over a metric computed by differentiable simulation. The final grasp is dependent on the pose initialization of the hand, so different grasps can be recovered by sampling different starting poses. We detail our method below, but first outline the challenges that motivate our design.

Non-smooth object geometry. When optimizing the location of contacts between a hand and a sphere, the gradient of contact normals with respect to contact positions is well-defined and continuous, allowing gradient-based optimization to smoothly adjust contact positions along the sphere surface. But most objects are not perfectly smooth. Discontinuities in surface normals (e.g., at the edges or corners of a cube) result in discontinuities in contact normals and their gradients with respect to contact positions. Gradient-based optimization cannot effectively optimize across these discontinuities (e.g., cannot follow the gradient to move contact locations from one face of a cube to another). We address this with a coarse-to-fine smoothing approach, optimizing against a smoothed and padded version of the object surface that gradually resolves to the true surface as optimization continues (see Section 3.2).

Contact sparsity. Of all possible contacts between the hand and object, only a sparse subset is active at any given time. If a particular point on the hand is inactive (not in contact with the object), then an infinitesimal perturbation of the hand pose will not change its status (make it touch the object). The gradient of the force applied by any inactive contact (with respect to hand pose) will be exactly zero. This means that gradient-based optimization can not effectively create new contacts, since contacts that are not already active do not contribute to the gradient. We address this by allowing gradient to leak through the force computations of inactive contacts (see Section 3.3).

Rugged optimization landscape. When many contacts are active (i.e., hand touching the object), small changes to hand pose may result in large changes to contact forces and, subsequently, large changes to our grasp metric. This makes gradient-based optimization challenging. We address this with a problem relaxation inspired by Contact-Invariant Optimization [mordatch2012discovery, mordatch2012contact] (see Section 3.4).

3.1 Rigid body dynamics

In the interest of speed and simplicity, we limit ourselves to simple rigid body dynamics. Let $q$ and $u$ be the joint and spatial coordinates, respectively, with first and second time derivatives $˙ q$ , $¨ q$ , $˙ u$ , $¨ u$ . Let M be the mass matrix. The kinematic map H maps joint coordinate time derivatives to spatial velocities as $˙ q = H (q) u$ , and is related to contact and external forces ( $f_{c}$ and $f_{ext}$ ) through the following motion equation: $H M H^{⊤} ¨ q = f_{c} + f_{ext}$ , which yields the semi-implicit Euler update used for discrete time stepping [bender2014interactive]:

	${˙ q}^{(t + 1)}$	$\leftarrow {˙ q}^{(t)} + Δ t M^{- 1} (f_{c} + f_{ext})$		(1)
	$q^{(t + 1)}$	$\leftarrow q^{(t)} + Δ t {˙ q}^{(t + 1)} .$		(2)

3.2 Object model with coarse-to-fine surface smoothing

SDF representation. For the purpose of collision detection, the hand is represented by a set of surface points $X_{h}$ , and the object to grasp is represented by its Signed Distance Function (SDF), $ϕ (x)$ (similar to [fuhrmann2003distance, macklin2020local, bender2014continuous]). The SDF maps a spatial position $x \in R^{3}$ to its distance to the closest point on the surface of the object, with a negative or positive sign for interior and exterior points, respectively [osher2006level]. The object surface can be recovered as the zero level-set of the SDF: ${x | ϕ (x) = 0}$ . The gradient of the SDF $\nabla ϕ (x)$ is always of unit magnitude, corresponds to the surface normal for $x$ on the object surface, and yields the closest point on the object as $x - ϕ (x) \nabla ϕ (x)$ . SDF representations are well-suited to differentiable collision detection [macklin2020local], since contact forces can be written in terms of a penetration depth ( $ϕ$ ) and normal direction ( $\nabla ϕ$ ), for which gradients can be computed as $\nabla ϕ$ and $\nabla^{2} ϕ$ , respectively.

Whereas primitive objects (e.g., a sphere or box) admit an analytic SDF, this is not the case for complex objects, for which an SDF representation is not readily available. We model the object to be grasped by a discretized SDF which we extract from ground truth meshes (easier to come by for most object sets [calli2017yale, chang2015shapenet]), yielding a 3D grid. Given a query point $x$ , to compute $ϕ (x)$ based on the grid, we first convert $x$ to local shape coordinates (where the object is in canonical pose: unrotated and centered at the origin), yielding $x_{local}$ . If $x_{local}$ falls within the bounds of the grid, we map it to grid indices and compute $ϕ (x_{local})$ by tri-linear interpolation of neighbouring grid cells. If $x_{local}$ falls outside the grid, we clamp it to the grid bounds, yielding $x_{clamp}$ , and compute $ϕ (x) := ϕ (x_{clamp}) + ∥ x - x_{clamp} ∥$ .

Coarse-to-fine smoothing. To successfully optimize contact locations over non-smooth object geometry we employ surface smoothing in a coarse-to-fine way. At the start of each optimization, we define the object surface not as the zero level-set of the SDF, but as the radius $r$ level-set: ${x | ϕ (x) = r > 0}$ , which gives a smoothed and padded version of the original surface. As optimization continues, we decrease $r$ on a linear schedule until it reaches $0$ , yielding the original surface. This coarse-to-fine smoothing allows gradient-based optimization to effectively move contact points across discontinuities and prevents the optimization from quickly overfitting to local geometric features. We set $r$ to approximately 10cm at the start of each optimization. Details are in Appendix A.2.

3.3 Contact dynamics with leaky gradient

Contact forces. We use a primal (penalty-based) formulation of contact forces, which allows us to compute derivatives with autodiff [baydin2018automatic] and keep a consistent memory footprint. For a given point $x \in X_{h}$ , the resultant contact force is

$f_{c}$	$= f_{n} + f_{t}$	(3)
$f_{n}$	$= k_{n} min (ϕ (x), 0) \nabla ϕ (x)$	(4)
$f_{t}$	$= - min (k_{f} ∥ v_{t} ∥, μ ∥ f_{n} ∥) v_{t},$	(5)

where $f_{n}$ is the normal component, proportional to penetration depth $ϕ (x)$ , and $f_{t}$ is the frictional component, computed using a Coulomb friction model. $k_{n}$ and $k_{f}$ are the normal and frictional stiffness coefficients, respectively, $μ$ is the friction coefficient, and $v_{t}$ is the component of relative velocity between hand and object at the contact point $x$ that is tangent to the contact normal $\nabla ϕ (x)$ .

Leaky gradients. At any one time, most possible hand-object contacts are inactive – a property we refer to as contact sparsity. Since an infinitesimal perturbation to hand pose will not activate these contacts (i.e., will not make them touch the object), the gradient of their contact forces with respect to hand pose is zero, i.e., $\partial f_{c} / \partial q = \partial f_{% c} / \partial ˙ q = \partial f_{c} / \partial ¨ q = 0$ . When the hand is not touching the object, all contacts are inactive and gradient-based optimization can get stuck in a plateau. We work around this by computing a leaky gradient for the normal force term. From equation (4), we have $\frac{\partial ∥ f_{n} ∥}{\partial q} = 0$ if $ϕ (x) \geq 0$ but we instead set

(6)

where $α \in [0, 1]$ controls how much gradient leaks through the minimum. We set $α = 0.1$ in our experiments.

3.4 Grasping metric and problem relaxation

Simulation setup. To compute the grasp metric, we simulate the rigid-body interaction between a hand and an object. The hand is kinematic (does not react to contact forces), while the object is dynamic (thus subject to contact forces). The simulator state is given by the configuration vector $q$ and its first and second time derivatives $˙ q, ¨ q$ . $q$ is composed of hand and object components $q = (q_{h}, q_{o})$ with corresponding spatial coordinates $u = (u_{h}, u_{obj})$ . The object is always initialized with the same configuration $q_{o}^{(0)}$ : unrotated and untranslated at the origin. Given a state $q^{(t)}$ , following equations (1) and (2), our simulator uses a semi-implicit Euler update scheme to compute subsequent state $q^{(t + 1)}$ .

Computing the grasp metric by simulation. To measure the quality of a candidate grasp $q_{h}$ , we test its ability to withstand forces applied to the object. Given an initial state $q^{(0)} = (q_{h}, q_{o}^{(0)})$ , we apply an initial velocity ${˙ q}_{o}^{(0)}$ to the object. The hand is kept static, with ${˙ u}_{h} = 0$ . We run forward simulation to compute the object’s final velocity ${˙ u}_{o}^{(T)}$ . A stable grasp will produce contact forces that resist the object velocity, so lower $∥ {˙ u}_{o}^{(T)} ∥$ indicates a more stable grasp. In fact, a stable grasp should be able to resist object velocities in any direction, so we perform multiple simulations with different initial velocities and average the results. This suggests the following basic grasp metric: for each set of $M$ simulations, indexed by $m = {1, \dots, M}$ , we set a different initial object velocity, run the simulation, and record $L_{m} = ∥ {˙ u}_{o}^{(T)} ∥$ . Then, averaging, we have

L_{grasp} = M \sum m = 1 \frac{L_{m}}{M} .

(7)

Since $L_{grasp}$ is a differentiable function of the output of a differentiable simulation, it is itself differentiable with respect to $q_{h}$ , and we can compute loss gradients $\partial L_{grasp} / \partial q_{h}$ and use gradient-based optimization to find stable grasps.

Unfortunately, in practice, this basic procedure does not succeed. As explained at the beginning of Section 3, the grasp optimization landscape is extremely rugged, with sharp and narrow ridges, peaks, and valleys. Our leaky contact force gradients (see Section 3.3) provide some help in escaping plateaus, but once the hand is in contact with the object, small changes in hand configuration still cause large jumps in contact forces by making/breaking contacts and shifting contact normals. However, differentiability alone does not resolve this issue.

Problem relaxation. Inspired by Contact-Invariant Optimization [mordatch2012discovery, mordatch2012contact] we relax the problem making it more forgiving to gradient-based optimization. Specifically, we introduce additional desired or prescribed contact force variables. This allows us to model physics violations as a cost rather than a constraint. For each surface point on the hand $x^{i} \in X_{h}$ , we introduce a 6-dimensional vector ${ˆ f}_{c}^{i}$ representing the desired hand-object contact wrench arising from contact at $x^{i}$ .

Our overall loss now has two components. The task loss $L_{task} ({ˆ f}_{c})$ measures whether the prescribed forces ${ˆ f}_{c}$ successfully resist initial object velocities. This is computed identically to the previous $L_{grasp}$ , except that instead of computing contact forces according to equations (3), (4) and (5), contact forces are simply set equal to ${ˆ f}_{c}$ . The physics violation loss $L_{phys} (q_{h}, {ˆ f}_{c})$ measures whether the hand configuration $q_{h}$ actually provides the desired forces ${ˆ f}_{c}$ . It is computed as

L_{phys} (q_{h}, {ˆ f}_{c}) = ∥ f_{c} (q_{h}) - {ˆ f}_{c} ∥,

(8)

where $f_{c} (q_{h})$ is the contact force arising from the hand pose $q_{h}$ according to equations (3), (4) and (5).

Intuitively, minimizing these losses corresponds to finding a set of desired forces (as close as possible to the actual contact forces arising from the current hand configuration) that complete the task, and finding a hand configuration that provides those forces. We expect problem formulations derived from and inspired by Contact-Invariant Optimization [ciocarlie2007dexterous, ciocarlie2009hand] to be a fruitful area of research as they are made newly attractive by advances in differentiable simulation.

Additional heuristic losses. We include some additional losses that improve the plausibility of resulting grasps. Most hand models have defined joint range limits. Let $q_{h}^{low}$ and $q_{h}^{up}$ be the lower and upper joint limits respectively. $L_{range}$ encourages hand joints to be near the middle of their ranges. $L_{limit}$ penalizes hand joints outside of their range. $L_{inter}$ penalizes self intersections of the hand.

$L_{range} (q_{h})$	$= ∥ q_{h} - \frac{q_{h}^{% u p} + q_{h}^{low}}{2} ∥$	(9)
$L_{limit} (q_{h})$	$= max (q_{h} - q_{h}^{up%}, 0) + max (q_{h}^{low} - q_{h}, 0)$	(10)
$L_{inter} (q_{h})$	$= ∥ f_{link} ∥ .$	(11)

The hand is kinematic, so it is not subject to contact forces. However, we still compute forces arising from contact between the hand links, for use in this loss term, as $f_{link}$ . We ignore contacts between neighbouring links in the chain. For the purpose of computing $f_{link}$ , we represent each hand link as both a point set and an SDF and compute $f_{link}$ according to equations (3), (4), and (5).

3.5 Optimization

We use the Modified Differential Multiplier Method [platt1987constrained], treating $L_{task} < C_{task}$ and $L_{limit} < C_{limit}$ as constraints, while minimizing $L_{phys}$ , $L_{limit}$ and $L_{inter}$ . We update our parameters ${ˆ f}_{c}$ and $q_{h}$ using the Adamax [kingma2014adam] optimizer. Details of learning rates, $C_{task}$ and $C_{limit}$ can be found in Appendix A.7.

4 Experiments

Our evaluations and analysis of Grasp’D answer the following questions:

How well does Grasp’D perform compared to analytic methods? (Section 4.2)
Can Grasp’D generalize to objects reconstructed from real-world RGBD images? (Section 4.3)
How much do coarse-to-fine SDF collision and the problem relaxation contribute to final performance? (Section 4.4)

4.1 Experimental setup

For each experiment, we synthesize grasps following the procedure described in Section 3. We compute the metric with $M = 3$ simulations: each setting a different initial velocity on the hand: $(0, 0, 0)$ , $(0.01, 0.01, 0.01)$ or $(- 0.01, - 0.01, - 0.01)$ m/s. Each simulation is run for a single timestep of length $1 \times 10^{- 5}$ s.

Evaluation metrics. We follow [hasson2019learning] and use contact area (CA), intersection volume (IV), and the ratio between contact area and intersection volume ( $\frac{C A}{I V}$ ). We compute evaluation metrics that measure grasp stability and contact area. In addition, we measure the contact area each grasp creates and the volume of hand-object interpenetration. We compute two analytic measures of stability – the Ferrari-Canny (epsilon $ϵ$ ) [ferrari1992planning] and the volume metric (Vol) – and one simulated measure: the simulation displacement (SD) metric introduced in [hasson2019learning].

Hand parameterization. We use a differentiable PyTorch layer [hasson2019learning] to compute the 773 vertices of the MANO hand [MANO:SIGGRAPHASIA:2017] model. The input is a set of weights for principal components extracted from the MANO dataset of human scans [MANO:SIGGRAPHASIA:2017]. We find that this PCA parameterization provides a useful prior for human-like hand poses. We use the maximum number of principal components (44).

Method	CA $↑$	IV $↓$	$\frac{C A}{I V}$ $↑$	$ϵ$ $↑$	Vol $↑$	SD $↓$
Scale (Unit)	${cm}^{2}$	${cm}^{3}$	${cm}^{- 1}$	$\times 10^{- 1}$	$\times 10^{1}$	cm
ObMan [hasson2019learning] (top2)	$9.4$	$1.28$	$7.37$	$4.70$	$1.36$	$1.95$
ObMan [hasson2019learning] (top5)	$7.8$	1.05	$7.37$	$4.52$	$1.36$	$2.22$
Grasp’D (top2)	$43.0$	$5.70$	7.55	$5.01$	$1.44$	0.59
Grasp’D (top5)	41.4	$5.48$	7.55	5.02	1.46	$1.04$

Table 2: Experimental results. We synthesize MANO hand grasps for ShapeNet objects. Our grasps achieve over

4 \times

denser contact (as measured by contact surface area - CA) than an analytic synthesis baseline [hasson2019learning], leading to significantly higher grasp stability (

4 \times

lower simulation displacement - SD). Higher contact does result in higher interpenetration, but we keep a similar ratio of contact area to interpenetration volume.

4.2 Grasp synthesis with ShapeNet models

We compare to baseline grasps from the ObMan [hasson2019learning] dataset, which generates grasps with the GraspIt! [miller2004graspit] simulator using an analytic metric. We report these metrics over the top-2 and top-5 grasps per scaled object, with ranking decided by simulation displacement for our method and by ObMan’s heuristic measure (detailed in Appendix C.2 of [hasson2019learning]) for theirs. Further details in Appendix A.6.

Data. We evaluate our approach to grasp synthesis by generating grasps with the MANO human hand [MANO:SIGGRAPHASIA:2017] model for 57 ShapeNet [chang2015shapenet] objects that span 8 categories (bottles, bowls, cameras, cans, cellphones, jars, knives, remote controls), and are each considered at 5 different scales (as in ObMan). See the Appendix A for details of mesh pre-processing, initialization, simulation, and optimization.

Results. Results are presented in Table 2. Grasps generated by our method (both top-2 and top-5) have a contact area of around $42 {c m}^{2}$ . This is higher than the $\sim 20 {c m}^{2}$ area achieved with fingertip only grasps [brahmbhatt2019contactdb] and about $4 \times$ higher than grasps from the ObMan dataset (top-2 or top-5). These contact-rich grasps achieve modest improvements in analytic measures of stability, and a significant reduction in simulation displacement ( $\sim 3 \times$ for top-2 grasps). Visualizations of our generated grasps in Figure 1 confirm that these grasps achieve larger areas of contact by closely conforming to object surface geometry, whereas the analytically generated grasps largely make use of fingertip contact only. These higher contact grasps have accordingly higher interpenetration, but the ratio between contact area and intersection volume is similar to the ObMan baseline.

Figure 4: Grasp synthesis from RGB-D. We use RGB-D captures from the YCB dataset [calli2017yale] (top row) to reconstruct object models from which we synthesize grasps (bottom row). Our method can synthesize plausible grasps not just from ground truth object models, but also from imperfect reconstructions.

4.3 Grasp synthesis from RGB-D input of unknown objects

Setting. One possible application of our method is to direct grasp prediction from RGB-D images by simulation on reconstructed object models. Currently, our method is too slow to be used online (about 5 minutes per grasp), but as simulation speeds increase and recent works in implicit fields push reconstruction accuracy higher and higher, we believe that grasp prediction by simulation models will become increasingly viable. To validate the plausibility of using our method with reconstructed object models, we present results from running our system on meshes reconstructed from RGB-D inputs. We synthesize grasps based on RGB-D (with camera pose) inputs from the YCB object dataset [calli2017yale]. In addition to reconstructed meshes, the YCB dataset provides the original RGB-D captures the meshes are based on. Each object was captured from 5 different cameras at 120 different angles for a total of 600 images. To confirm that our method can work with reconstructions done under more realistic assumptions, we limit our reconstructions to using 5 different angles from 3 cameras (2.5% of captures).

Data. For a subset of the YCB objects, we generate Poisson surface reconstructions and use our method to synthesize MANO hand grasps. Since the inputs are from cameras with a known pose, the object reconstruction is in the world frame. Details in the Appendix A.4.

Results. Our results confirm the viability of using simulation to synthesize grasps on reconstructed object models. Qualitative results are presented in Figure 4; additional results can be found in Appendix D. Although synthesis does not perform as well as with ground-truth models, plausible human grasps are discovered for many objects and the grasps appear well-aligned with the real-world object poses. Future work could take advantage of learning-based reconstruction methods to achieve grasp synthesis with fewer input images.

Method	CA $↑$	IV $↓$	$\frac{C A}{I V}$ $↑$	$ϵ$ $↑$	Vol $↑$	SD $↓$
Scale/Unit	${cm}^{2}$	${cm}^{3}$	${cm}^{- 1}$	$\times 10^{- 1}$	$\times 10^{1}$	cm
Grasp’D	$42.6$	$2.83$	$15.1$	$2.38$	$20.6$	$0.41$
Grasp’D w/o coarse-to-fine	$43.2$	$2.84$	$15.2$	$2.37$	$20.7$	$0.55$
Grasp’D w/o problem relaxation	$6.1$	$0.40$	$15.2$	$0.52$	$4.0$	$3.82$

Table 3: Ablation study. We validate our design choices with an ablation study. Our relaxed problem formulation has a large positive impact on all metrics. The quantitative impact of coarse-to-fine smoothing is more limited, but we observe a qualitative difference in grasps generated with and without smoothing.

4.4 Ablation study

We investigate the impact of our coarse-to-fine smoothing (Section 3.2), leaky contact force gradients (Section 3.3), and relaxed problem formulation (Section 3.4). We generate MANO hand grasps on 21 objects from the YCB dataset [calli2017yale]. Grasp’D w/o coarse-to-fine does not pad or smooth the object. Grasp’D w/o problem relaxation attempts to solve the problem without introducing additional force variables or a relaxed objective. This amounts to the “basic procedure” described in Section 3.4, i.e., directly optimize over hand pose to minimize $L_{grasp}$ and the heuristic losses.

Results. We adopt the same data as in Section 4.2. Table 3 presents the results. Our relaxed problem formulation is key to our method’s success, and without it, performance greatly degrades by all measures, with discovered grasps creating very little contact (low contact area and intersection volume). Coarse-to-fine smoothing has a modest impact, with all metrics comparable with or without smoothing, except for simulation displacement, which is about 25% higher without smoothing. We did not include a variant without leaky gradient, since this variant would never make contact with the object (if the hand is not touching the object at initialization, there will be no gradient to follow and optimization will immediately be stuck in a plateau).

5 Conclusions

We presented a simulation-based grasp synthesis pipeline capable of generating large datasets of plausible, high-contact grasps. By being differentiable, our simulator is amenable to gradient-based optimization, allowing us to produce high-quality grasps, even for multi-finger grippers, while scaling to thousands of dense contacts. Our experiments have shown that we outperform the existing classical grasping algorithm both quantitatively and qualitatively. Our approach is compatible with PyTorch and can be easily integrated into existing pipelines. More importantly, the produced grasps can directly benefit any vision pipeline that learns grasp prediction from synthetic data. Grasp’D: Differentiable Contact-rich Grasp Synthesis for Multi-fingered Hands
Supplementary Material

Dylan Turpin Liquan Wang Eric Heiden Yun-Chun Chen Miles Macklin Stavros Tsogkas Sven Dickinson Animesh Garg ¹¹institutetext: $^{1}$ University of Toronto, $^{2}$ Vector Institute, $^{3}$ Nvidia, $^{4}$ Samsung
¹¹email: dylanturpin@cs.toronto.edu

Overview

In this supplementary material, we provide additional details and results to complement the main paper. Specifically:

We describe the details of our implementation and experimental setting. (Appendix 0.A).
We provide additional results of our method applied to the YCB dataset [calli2017yale] with both a human MANO hand model [MANO:SIGGRAPHASIA:2017] and a robotic Allegro hand model. (Appendix 0.B).
We provide visualizations of optimization trajectories for MANO hand grasps of YCB and ShapeNet objects, which show how grasps improve as optimization progresses. (Appendix 0.C)
We provide additional results for the validation of grasp synthesis with RGB-D reconstruction presented (Section 4.3 of the main paper). (Appendix 0.D).

Appendix 0.A Details of implementation and experiments

0.a.1 Dataset listings

Table 4 - ShapeNet object listing (for main experiment in section 4.2).
Table 5 - YCB object listing (for rgb-d experiment in section 4.3).
Table 6 - YCB object listing (for ablation experiment in section 4.4).

Category ID Shape ID 2876657 1071fa4cddb2da2fc8724d5673a063a6 2876657 109d55a137c042f5760315ac3bf2c13e 2876657 10dff3c43200a7a7119862dbccbaa609 2876657 10f709cecfbb8d59c2536abb1e8e5eab 2876657 114509277e76e413c8724d5673a063a6 2876657 1349b2169a97a0ff54e1b6f41fdd78a 2876657 134c723696216addedee8d59893c8633 2880940 12ddb18397a816c8948bef6886fb4ac 2880940 13e879cb517784a63a4b07a265cff347 2880940 154ab09c67b9d04fb4971a63df4b1d36 2880940 18529eba21e4be8b5cc4957a8e7226be 2880940 188281000adddc9977981b941eb4f5d1 2880940 1a0a2715462499fbf9029695a3277412 2880940 1b4d7803a3298f8477bdcb8816a3fac9 2942699 1298634053ad50d36d07c55cf995503e 2942699 147183af1ba4e97b8a94168388287ad5 2942699 15e72ce7a8a328d1fd9cfa6c7f5305bc 2942699 17a010f0ade4d1fd83a3e53900c6cbba 2942699 1967344f80da29618d342172201b8d8c 2942699 1ab3abb5c090d9b68e940c4e64a94e1e 2942699 1cc93f96ad5e16a85d3f270c1c35f1c7 2946921 100c5aee62f1c9b9f54f8416555967 2946921 10c9a321485711a88051229d056d81db 2946921 11c785813efc4b8630eaaf40a8a562c1 2946921 129880fda38f3f2ba1ab68e159bfb347 2946921 147901ede668deb7d8d848cc867b0bc8 2946921 17ef524ca4e382dd9d2ad28276314523 2946921 19fa6044dd31aa8e9487fa707cec1558 2992529 1101db09207b39c244f01fc4278d10c1 2992529 1105c21040f11b4aec5c418afd946fad 2992529 112cdf6f3466e35fa36266c295c27a25 2992529 113303df7880cd71226bc3b9ce9ff2a1 2992529 11e925e3ea180b583388c2584b2f0f90 2992529 11f7613cae7d973fd7e59c29eb25f02f 2992529 128bb46234d7250721844676433a0aca 3593526 10af6bdfd126209faaf0ad030fc37d94 3593526 1168c9e9db2c1c5066639e628d6519b6 3593526 117843347cde5b502b18a5129db1b7d0 3593526 1252b0fc818969ebca2ed12df13a916a 3593526 12d643221a3edaa4ab361b6be63163da 3593526 12ec19e85b31e274725f67267e31c89 3593526 133dc38c1316d9515dc3653f8341633a 3624134 102982a2159226c2cc34b900bb2492e 3624134 118141f7d22bc46eaeb7b7328341827a 3624134 11c987c9a34457e48c2fa4fb6bd3e62 3624134 135f75a374a1e22c46cb8dd27ae7fcd 3624134 13bf5728b1f3b6cfadd1691b2083e9e7 3624134 13d183a44f143ca8c842482418ab083d 3624134 1460eded8006b10139c78a1e40e247f3 4074963 1941c37c6db30e481ef53acb6e05e27a 4074963 1aa78ce410bdbcd92530f02db7e9157e 4074963 2053bdd83749adcc1e5c09d9fe5c0c76 4074963 226078581cd4efd755c5278938766a05 4074963 240456647fbca47396d8609ec76a915b 4074963 25182f6e03375c9e7b6fd5468f603b31 4074963 259539bd48513e3410d32c800df6e3dd

Table 4: For experiment 1 (comparison to Obman) in Section 4.2 of the main paper, we use the following ShapeNet objects.

Object ID Object Name 001 chips_can 002 master_chef_can 005 tomato_soup_can 006 mustard_bottle 008 pudding_box 010 potted_meat_can 021 bleach_cleanser 035 power_drill

Table 5: For experiment 2 (RGBD reconstruction) we use the following YCB objects.

Object ID Object Name 002 master_chef_can 003 cracker_box 004 sugar_box 005 tomato_soup_can 006 mustard_bottle 007 tuna_fish_can 008 pudding_box 009 gelatin_box 010 potted_meat_can 011 banana 019 pitcher_base 021 bleach_cleanser 024 bowl 025 mug 035 power_drill 036 wood_block 037 scissors 040 large_marker 051 large_clamp 052 extra_large_clamp 061 foam_brick

Table 6: For experiment 2 (RGBD reconstruction) we use the following YCB objects.

0.a.2 Initialization and smoothing schedule

0.a.2.1 Initialization.

Since our grasp synthesis pipeline relies on gradient-based optimization, the final result depends on how the parameters are initialized, i.e., different initial hand poses will recover different final grasps. This is a useful quality in that it allows us to sample a variety of grasps for each object by sampling different starting poses. The force variables ${ˆ f}_{c}$ are always initialized to zero. We employ a simple heuristic (adapted from [brahmbhatt2019contactgrasp]) to initialize the hand pose $q_{h}^{(0)}$ . We set all hand joints to their fully open position. To find an initial rotation and position for the hand base link, we uniformly sample an approach point $a$ on the object surface and a roll angle $θ$ around the approach vector. We use an approach vector opposing the object surface normal at the approach point, and set the hand rotation such that the palm’s normal is aligned with the approach vector. Finally we apply the sampled roll $θ$ around the approach vector. We set the hand position so that the palm’s center is at a $10$ cm distance from the approach point along the approach vector.

0.a.2.2 Coarse-to-fine smoothing schedule.

We set the initial value for the coarse-to-fine smoothing radius $r$ to the distance between the object and the closest point on the hand less $1$ cm. The radius is then decreased to $0$ on a linear schedule over the first 5,000 steps of a 7,000 step optimization and remains at $0$ for the final 2,000 steps. The early steps of the optimization find a rough pose for the hand (where on the object to grasp and an approximate finger configuration) and the later steps optimize over fine-grained geometry, allowing the discovery of grasps that conform closely to detailed surface geometry.

0.a.3 Mesh processing

We use a discretized SDF representation as described in Section 3.2 of the main paper. Computing the SDF involves some preprocessing. For the experiments in Section 4.2 and 4.4 (on complete ShapeNet and YCB meshes respectively), the input is a mesh from the relevant dataset. For the RGB-D reconstruction experiment in Section 4.3, the input is a reconstructed mesh (see Appendix 0.A.4 for details of the reconstruction pipeline). To compute the sign of the SDF at a given query point, we must determine whether that point is inside or outside the object. This is more straightforward if the mesh consists of a single closed surface, so we first run ManifoldPlus [huang2020manifoldplus] to compute a watertight mesh. Next, a ( $256 \times 256 \times 256$ ) grid of points is evenly sampled over the mesh bounding box (padded by $1$ cm) and the signed distance of each point to the mesh is computed using libigl [jacobson2017libigl].

0.a.4 Reconstruction pipeline

We describe the RGB-D reconstruction pipeline used in Section 4.3. The YCB object dataset includes RGB-D captures for each object. The object is placed on a spinning platter surrounded by 5 cameras and is captured at each of 120 different angles as the plate is rotated in 3 degree increments. We take 15 of these depth images (captures from the first, third and fifth camera at 5 angles in $72$ degree increments). We run the code provided alongside the YCB dataset in order to register the depth maps and combine them into a single world frame point cloud. We create a Poisson reconstruction [kazhdan2006poisson] of this point cloud using the Open3D library [Zhou2018] with a depth of $5$ . The resulting mesh is still incomplete because the bottom of the object is not visible (since it is the contact surface between the table and the object) We use PyMeshFix [sullivan2019pyvista] to complete this and any other remaining holes in the mesh.

0.a.5 Simulation details

We run each simulation for a single timestep of length $1 \times 10^{- 5}$ seconds. For the MANO hand model, all $773$ vertices are used as contact locations. For the Allegro hand model, we sample $\sim$ 3000 points on the surface to use as contact locations. In all experiments we set the normal stiffness to $k_{n} = 1 \times 10^{6}$ , frictional stiffness to $k_{f} = 1 \times 10^{8}$ , and the friction coefficient to $μ = 0.8$ . For the leaky gradient (described in Section 3.3 of the main paper) we set the proportion of gradient that leaks through non-colliding contact forces to $α = 0.1$ . Note that the above applies to simulation during grasp optimization. When we compute simulation displacement for evaluation purposes, we do not use our own simulator, but instead use PyBullet [coumans2021] (details in Appendix 0.A.6).

0.a.6 Evaluation details

We evaluate grasps in terms of their contact patterns and stability.

0.a.6.1 Interpenetration volume

is the volume of the intersection between the hand and the object. Lower values are better (since in reality the hand cannot penetrate the hard object). We compute this by voxelizing the hand (with 1mm resolution) and querying the object’s SDF at each voxel position to decide if each hand voxel is overlapping the object or not.

0.a.6.2 Contact area

is the area of surface contact (in $c m^{2}$ ) between the hand and the object. This is computed similarly to interpenetration volume, except that only the hand surface is voxelized (i.e., the hand is treated as an empty shell, not a solid volume).

0.a.6.3 Contact area to interpenetration volume ratio.

Interpenetration can be avoided by simply avoiding contact with the object entirely, so there is a trade off between interpenetration volume and the other metrics. To capture the amount of interpenetration, conditional on the amount of contact, we report the ratio of contact area to interpenetration.

0.a.6.4 $ϵ$ (Ferrari-Canny) metric

measures grasp stability using the magnitude of the smallest force that can break a grasp. A more stable grasp can withstand larger forces, so a larger force will be needed to break the grasp. This quantity is equivalent to the size of the largest origin-centered ball contained in the Grasp Wrench Space (GWS [ferrari1992planning]). The GWS is the space of wrenches the contacts induced by the grasp can withstand, assuming that the total hand-object wrench will be a linear combination of the wrenches at each contact with coefficients summing to $1$ . Under a Coulomb friction model, the possible wrenches at each contact are defined by a friction cone, which we approximate by a pyramid.

0.a.6.5 Volume metric

is an alternate measure of stability that considers all the possible forces a grasp can withstand (instead of just the magnitude of the smallest force that breaks the grasp). The volume metric is simply the volume of the GWS.

0.a.6.6 Simulation displacement

is a simulation-based, rather than analytic, measure of stability. We use GANHand’s implementation [corona2020ganhand] of a simulation displacement metric in PyBullet [coumans2021] to measure grasp stability by checking how far the object is displaced from its initial pose when the grasp is applied. We use the default physics parameters provided by GANHand except for setting the friction coefficient to $1.2$ (instead of the seemingly high default of $3.0$ ). Whereas we train (i.e., optimize) our grasps in our own custom simulator, this metric is computed in a widely used third-party simulator (PyBullet), with a different collision detector, contact model and time stepping scheme. This avoids giving ourselves an unfair advantage by training (optimizing) and testing (computing evaluation metrics) with the same contact model and physics engine (which baselines we compare to did not have access to).

0.a.6.7 True signed distances.

Whenever a metric relies on the object SDF (e.g., to compute contact forces or to determine if a voxel is intersecting the object or not), we compute that SDF with libigl [jacobson2017libigl] using the ground truth mesh instead of a discrete grid approximation of the SDF.

0.a.6.8 Evaluating on top $k$ grasps.

For our method, we report metrics for the top 2 and top 5 grasps (ordered by simulation displacement – details below) for each object. For the ObMan dataset, we report the top 2 and top 5 grasps for each object (ordered by their heuristic measure, described below). The ObMan generation procedure uses the GraspIt! simulator to synthesize grasps by optimizing (with simulated annealing) over an analytic metric. Many grasps for each object are generated by running about 70k steps annealing steps. The top 2 grasps are then selected according to a heuristic measure (see Appendix C.2 of [hasson2019learning]) which encourages palm and phalange contact. This heuristic was explicitly added to compensate for the bias of analytic synthesis towards fingertip-only grasps. To test our own method, we generate 10 grasps for each object, each using 7000 optimization steps. We report these metrics over the top 2 and top 5 grasps with the lowest simulation displacement.

0.a.7 Optimization details

We used the ADAMax [zhang2018improved] optimizer to update the the hand pose parameters $q_{h}^{(0)}$ (with a learning rate of $3 \times 10^{- 3}$ ) and force parameters ${ˆ f}_{c}$ (with a learning rate of $1 \times 10^{- 2}$ ). Some objectives are more important than others, so are treated as constraints to satisfy rather than costs to minimize. Specifically, we use the Modified Differential Multiplier Method [platt1987constrained], treating $L_{task} < C_{task}$ and $L_{limit} < C_{limit}$ as constraints, while minimizing $L_{phys}$ , $L_{range}$ and $L_{inter}$ . We set $C_{task} = 1 \times 10^{- 4}$ and $C_{limit} = 1 \times 10^{- 4}$ . Damping is set to $1.0$ for all constraints. During MANO hand experiments, we do not use the joint limit loss $L_{limit}$ or joint limit constraint, as these limits appear to be well-handled implicitly by the PCA parameterization. Similarly, we do not compute the self-intersection loss for the MANO hand, yet recover grasps with low self-intersection due to the hand parameterization. All losses are used enabled for the Allegro hand.

0.a.8 Timing

On a mobile Nvidia RTX 2070, generating a MANO hand grasp for a YCB object (by taking 7,000 optimizer steps) takes about 5 minutes. The MANO hand has only 773 vertices, so the memory footprint of the simulation is limited and three grasps can be synthesized in parallel, reducing average grasp synthesis time to about 2 minutes. While not yet approaching realtime performance, this is comparable to the speed of analytic synthesis with the GraspIt! simulator [miller2004graspit], which takes around 5 minutes [corona2020ganhand] to synthesize a grasp when using the eigengrasp planner with simulated annealing as for the ObMan dataset [hasson2019learning].

Appendix 0.B YCB results

We provide additional examples of applying our method to objects from the YCB dataset with both the Allegro robotic hand and the MANO human hand models (see Figures 5 and 6 respectively).

Appendix 0.C Optimization trajectories

To visualize how grasps evolve as optimization progresses, we render hand poses at regular intervals throughout optimization trajectories for 10 grasps (5 for objects from YCB, 5 for objects from ShapeNet – see Figure 7 and Figure 8 respectively).

Appendix 0.D Additional RGB-D results

We provide additional qualitative results of applying our methods to objects reconstructed from RGBD images in the YCB dataset. Figure 9 shows 3 synthesized grasps for each object visualized from 2 different viewpoints. We also provide quantitative results for our RGB-D experiment (section 4.3 of the main paper) in Table 7.

Input	CA $↑$	IV $↓$	$\frac{C A}{I V}$ $↑$	SD $↓$
GT-Mesh	42.6	2.83	15.1	0.41
RGB-D	$25.46$	$7.82$	$3.26$	$5.68$

Table 7: RGB-D experiment quantitative results. Performance with reconstructions is poorer than with ground truth object models, but the resulting grasps are still visually plausible (see Figure 4 of the main paper, Figure 5 of the supplemental) and metrics are comparable to Grasping Field [karunratanakul2020grasping] and GANHand [corona2020ganhand]. Small errors in reconstruction may produce large errors in grasp synthesis, a drawback future work might address by optimizing over the reconstruction alongside the grasp.

Appendix 0.E Training on synthesized data

Fine-tuning Grasping Field [karunratanakul2020grasping] with data generated by Grasp’D improves performance on unseen YCB objects better than additional GraspIt! [miller2004graspit] data. Table 8 displays the result of fine-tuning a pre-trained Grasping Field network with additional data synthesized by either the GraspIt! simulator or Grasp’D. The network is first trained for 1400 epochs on the ObMan dataset [hasson2019learning] (of synthetic GraspIt! grasps) and then fine-tuned for 100 epochs on 1000 new grasps (of ShapeNet [chang2015shapenet] objects already included in the ObMan dataset) before final testing on 8 objects from the YCB set. Fine-tuning with Grasp’D data results in significantly higher-contact grasps. This comes with a slight increase in intersection volume, but the ratio of contact area to intersection is improved, as is the simulation displacement.

Data source	CA $↑$	IV $↓$	$\frac{C A}{I V}$ $↑$	SD $↓$
GraspIt! [58]	$15.78$	9.44	$1.67$	$3.30$
Grasp’D	21.00	$11.23$	1.78	2.88

Table 8: Fine-tuning Grasping Field [karunratanakul2020grasping] with data generated by Grasp’D improves performance on unseen YCB objects better than additional GraspIt! [miller2004graspit] data.

Figure 5: Robotic grasping with the four-fingered Allegro hand. Our method works equally well with robotic and human hand models. We visualise grasps of three YCB objects [calli2017yale] with the four-fingered Allegro robotic hand. We can recover a variety of grasps for each object by sampling different initial hand poses (which gradient-based optimization takes to different final grasps). See Appendix 0.A.2 for details of initialization.

Figure 6: Synthesized MANO hand grasps of YCB objects. Our method generates contact-rich grasps for objects from the YCB dataset. These qualitative results are drawn from the ablation study in Section 4.4 of the main paper, specifically with all features turned on, corresponding to the row labelled Grasp’D in Table 3.

Figure 7: Optimization trajectories for MANO hand grasps of YCB objects. Grasps improve as optimization progresses (from left-to-right in the figure). We visualize the optimization paths that result in the final grasps in Figure 6. The leftmost column shows the initial hand pose (see Appendix 0.A.2 for details of initialization) and the optimization progresses from left to right until reaching the final grasps in the rightmost column. Initially, the hand is not even in contact with the object, but as optimization continues the grasp becomes higher contact, more plausible, and more stable.

Figure 8: Optimization trajectories for MANO hand grasps of ShapeNet objects. Grasps improve as optimization progresses (from left-to-right in the figure). We visualize the optimization paths that result in the final grasps in the second row of Figure 1 of the main paper.

Figure 9: Grasp synthesis from RGB-D. We use RGB-D captures from the YCB dataset [calli2017yale] to reconstruct object models from which we synthesize grasps (see section 4.3 of the main paper for details). Our method can synthesize plausible grasps not just from ground truth object models, but also from imperfect reconstructions.

Acknowledgements. DT was supported in part by a Vector research grant. The authors appreciate the support of NSERC, Vector Institute and Samsung AI. AG was also supported by NSERC Discovery Grant, NSERC Exploration Grant, CIFAR AI Chair, XSeed Discovery Grant from University of Toronto.

Grasp’D: Differentiable Contact-rich Grasp Synthesis for Multi-fingered Hands

Abstract

Keywords:

1 Introduction

1.0.1 Summary of contributions:

2 Related Work

3 Grasp’D: Differentiable Contact-rich Grasp Synthesis

3.1 Rigid body dynamics

3.2 Object model with coarse-to-fine surface smoothing

3.3 Contact dynamics with leaky gradient

3.4 Grasping metric and problem relaxation

3.5 Optimization

4 Experiments

4.1 Experimental setup

4.2 Grasp synthesis with ShapeNet models

4.3 Grasp synthesis from RGB-D input of unknown objects

4.4 Ablation study

5 Conclusions

Overview

Appendix 0.A Details of implementation and experiments

0.a.1 Dataset listings

0.a.2 Initialization and smoothing schedule

0.a.2.1 Initialization.

0.a.2.2 Coarse-to-fine smoothing schedule.

0.a.3 Mesh processing

0.a.4 Reconstruction pipeline

0.a.5 Simulation details

0.a.6 Evaluation details

0.a.6.1 Interpenetration volume

0.a.6.2 Contact area

0.a.6.3 Contact area to interpenetration volume ratio.

0.a.6.4 ϵ (Ferrari-Canny) metric

0.a.6.5 Volume metric

0.a.6.6 Simulation displacement

0.a.6.7 True signed distances.

0.a.6.8 Evaluating on top k grasps.

0.a.7 Optimization details

0.a.8 Timing

Appendix 0.B YCB results

Appendix 0.C Optimization trajectories

Appendix 0.D Additional RGB-D results

Appendix 0.E Training on synthesized data

References

Grasp’D: Differentiable Contact-rich
Grasp Synthesis for Multi-fingered Hands

0.a.6.4 $ϵ$ (Ferrari-Canny) metric

0.a.6.8 Evaluating on top $k$ grasps.