许多涉及某种形式的3D视觉感知的机器人任务极大地受益于对工作环境的完整知识。但是,机器人通常必须应对非结构化的环境,并且由于工作空间有限,混乱或对象自我划分,它们的车载视觉传感器只能提供不完整的信息。近年来,深度学习架构的形状完成架构已开始将牵引力作为从部分视觉数据中推断出完整的3D对象表示的有效手段。然而,大多数现有的最新方法都以体素电网形式提供了固定的输出分辨率,这与神经网络输出阶段的大小严格相关。尽管这足以完成某些任务,例如导航,抓握和操纵的障碍需要更精细的分辨率,并且简单地扩大神经网络输出在计算上是昂贵的。在本文中,我们通过基于隐式3D表示的对象形状完成方法来解决此限制,该方法为每个重建点提供了置信值。作为第二个贡献,我们提出了一种基于梯度的方法,用于在推理时在任意分辨率下有效地采样这种隐式函数。我们通过将重建的形状与地面真理进行比较,并通过在机器人握把管道中部署形状完成算法来实验验证我们的方法。在这两种情况下,我们将结果与最先进的形状完成方法进行了比较。
translated by 谷歌翻译
Real-world robotic grasping can be done robustly if a complete 3D Point Cloud Data (PCD) of an object is available. However, in practice, PCDs are often incomplete when objects are viewed from few and sparse viewpoints before the grasping action, leading to the generation of wrong or inaccurate grasp poses. We propose a novel grasping strategy, named 3DSGrasp, that predicts the missing geometry from the partial PCD to produce reliable grasp poses. Our proposed PCD completion network is a Transformer-based encoder-decoder network with an Offset-Attention layer. Our network is inherently invariant to the object pose and point's permutation, which generates PCDs that are geometrically consistent and completed properly. Experiments on a wide range of partial PCD show that 3DSGrasp outperforms the best state-of-the-art method on PCD completion tasks and largely improves the grasping success rate in real-world scenarios. The code and dataset will be made available upon acceptance.
translated by 谷歌翻译
我们提出了多视图表演者(MVP) - 从一系列时间顺序的视图中完成3D形状完成的新体系结构。MVP通过使用称为表演者的线性注意变压器来完成此任务。我们的模型允许当前对场景的观察到以前的观察,以更准确地填充。过去观察的历史通过紧凑的关联内存来压缩,该记忆近似于现代连续的霍普菲尔德内存,但至关重要的是与历史长度无关。我们将模型与几个基线进行比较,以便随着时间的推移完成形状完成,这证明了MVP提供的概括。据我们所知,MVP是第一个多重视图体素重建方法,它不需要对多个深度视图进行注册,也需要第一个基于因果变压器的模型进行3D形状完成。
translated by 谷歌翻译
Being able to grasp objects is a fundamental component of most robotic manipulation systems. In this paper, we present a new approach to simultaneously reconstruct a mesh and a dense grasp quality map of an object from a depth image. At the core of our approach is a novel camera-centric object representation called the "object shell" which is composed of an observed "entry image" and a predicted "exit image". We present an image-to-image residual ConvNet architecture in which the object shell and a grasp-quality map are predicted as separate output channels. The main advantage of the shell representation and the corresponding neural network architecture, ShellGrasp-Net, is that the input-output pixel correspondences in the shell representation are explicitly represented in the architecture. We show that this coupling yields superior generalization capabilities for object reconstruction and accurate grasp quality estimation implicitly considering the object geometry. Our approach yields an efficient dense grasp quality map and an object geometry estimate in a single forward pass. Both of these outputs can be used in a wide range of robotic manipulation applications. With rigorous experimental validation, both in simulation and on a real setup, we show that our shell-based method can be used to generate precise grasps and the associated grasp quality with over 90% accuracy. Diverse grasps computed on shell reconstructions allow the robot to select and execute grasps in cluttered scenes with more than 93% success rate.
translated by 谷歌翻译
Recent 3D-based manipulation methods either directly predict the grasp pose using 3D neural networks, or solve the grasp pose using similar objects retrieved from shape databases. However, the former faces generalizability challenges when testing with new robot arms or unseen objects; and the latter assumes that similar objects exist in the databases. We hypothesize that recent 3D modeling methods provides a path towards building digital replica of the evaluation scene that affords physical simulation and supports robust manipulation algorithm learning. We propose to reconstruct high-quality meshes from real-world point clouds using state-of-the-art neural surface reconstruction method (the Real2Sim step). Because most simulators take meshes for fast simulation, the reconstructed meshes enable grasp pose labels generation without human efforts. The generated labels can train grasp network that performs robustly in the real evaluation scene (the Sim2Real step). In synthetic and real experiments, we show that the Real2Sim2Real pipeline performs better than baseline grasp networks trained with a large dataset and a grasp sampling method with retrieval-based reconstruction. The benefit of the Real2Sim2Real pipeline comes from 1) decoupling scene modeling and grasp sampling into sub-problems, and 2) both sub-problems can be solved with sufficiently high quality using recent 3D learning algorithms and mesh-based physical simulation techniques.
translated by 谷歌翻译
Grasp learning has become an exciting and important topic in robotics. Just a few years ago, the problem of grasping novel objects from unstructured piles of clutter was considered a serious research challenge. Now, it is a capability that is quickly becoming incorporated into industrial supply chain automation. How did that happen? What is the current state of the art in robotic grasp learning, what are the different methodological approaches, and what machine learning models are used? This review attempts to give an overview of the current state of the art of grasp learning research.
translated by 谷歌翻译
成功掌握对象的能力在机器人中是至关重要的,因为它可以实现多个交互式下游应用程序。为此,大多数方法要么计算兴趣对象的完整6D姿势,要么学习预测一组掌握点。虽然前一种方法对多个对象实例或类没有很好地扩展,但后者需要大的注释数据集,并且受到新几何形状的普遍性能力差的阻碍。为了克服这些缺点,我们建议教授一个机器人如何用简单而简短的人类示范掌握一个物体。因此,我们的方法既不需要许多注释图像,也不限于特定的几何形状。我们首先介绍了一个小型RGB-D图像,显示人对象交互。然后利用该序列来构建表示所描绘的交互的相关手和对象网格。随后,我们完成重建对象形状的缺失部分,并估计了场景中的重建和可见对象之间的相对变换。最后,我们从物体和人手之间的相对姿势转移a-prioriz知识,随着当前对象在场景中的估计到机器人的必要抓握指令。与丰田的人类支持机器人(HSR)在真实和合成环境中的详尽评估证明了我们所提出的方法的适用性及其优势与以前的方法相比。
translated by 谷歌翻译
抓握是通过在一组触点上施加力和扭矩来挑选对象的过程。深度学习方法的最新进展允许在机器人对象抓地力方面快速进步。我们在过去十年中系统地调查了出版物,特别感兴趣使用最终效果姿势的所有6度自由度抓住对象。我们的综述发现了四种用于机器人抓钩的常见方法:基于抽样的方法,直接回归,强化学习和示例方法。此外,我们发现了围绕抓握的两种“支持方法”,这些方法使用深入学习来支持抓握过程,形状近似和负担能力。我们已经将本系统评论(85篇论文)中发现的出版物提炼为十个关键要点,我们认为对未来的机器人抓握和操纵研究至关重要。该调查的在线版本可从https://rhys-newbury.github.io/projects/6dof/获得
translated by 谷歌翻译
Generating grasp poses is a crucial component for any robot object manipulation task. In this work, we formulate the problem of grasp generation as sampling a set of grasps using a variational autoencoder and assess and refine the sampled grasps using a grasp evaluator model. Both Grasp Sampler and Grasp Refinement networks take 3D point clouds observed by a depth camera as input. We evaluate our approach in simulation and real-world robot experiments. Our approach achieves 88% success rate on various commonly used objects with diverse appearances, scales, and weights. Our model is trained purely in simulation and works in the real world without any extra steps. The video of our experiments can be found here.
translated by 谷歌翻译
如今,机器人在我们的日常生活中起着越来越重要的作用。在以人为本的环境中,机器人经常会遇到成堆的对象,包装的项目或孤立的对象。因此,机器人必须能够在各种情况下掌握和操纵不同的物体,以帮助人类进行日常任务。在本文中,我们提出了一种多视图深度学习方法,以处理以人为中心的域中抓住强大的对象。特别是,我们的方法将任意对象的点云作为输入,然后生成给定对象的拼字图。获得的视图最终用于估计每个对象的像素抓握合成。我们使用小对象抓住数据集训练模型端到端,并在模拟和现实世界数据上对其进行测试,而无需进行任何进一步的微调。为了评估所提出方法的性能,我们在三种情况下进行了广泛的实验集,包括孤立的对象,包装的项目和一堆对象。实验结果表明,我们的方法在所有仿真和现实机器人方案中都表现出色,并且能够在各种场景配置中实现新颖对象的可靠闭环抓握。
translated by 谷歌翻译
我们引入了来自多个机器人手的对象的神经隐式表示。多个机器人手之间的不同抓地力被编码为共享的潜在空间。学会了每个潜在矢量以两个3D形状的签名距离函数来解码对象的3D形状和机器人手的3D形状。此外,学会了潜在空间中的距离度量,以保留不同机器人手之间的graSps之间的相似性,其中根据机器人手的接触区域定义了grasps的相似性。该属性使我们能够在包括人手在内的不同抓地力之间转移抓地力,并且GRASP转移有可能在机器人之间分享抓地力,并使机器人能够从人类那里学习掌握技能。此外,我们隐式表示中对象和grasps的编码符号距离函数可用于6D对象姿势估计,并从部分点云中掌握触点优化,这可以在现实世界中启用机器人抓握。
translated by 谷歌翻译
在以人为本的环境中工作的机器人需要知道场景中存在哪种物体,以及如何掌握和操纵不同情况下的各种对象,以帮助人类在日常任务中。因此,对象识别和抓握是此类机器人的两个关键功能。最先进的解决物体识别并将其抓握为两个单独的问题,同时都使用可视输入。此外,在训练阶段之后,机器人的知识是固定的。在这种情况下,如果机器人面临新的对象类别,则必须从划痕中重新培训以结合新信息而无需灾难性干扰。为了解决这个问题,我们提出了一个深入的学习架构,具有增强的存储器能力来处理开放式对象识别和同时抓握。特别地,我们的方法将物体的多视图作为输入,并共同估计像素 - 方向掌握配置以及作为输出的深度和旋转不变表示。然后通过元主动学习技术使用所获得的表示用于开放式对象识别。我们展示了我们掌握从未见过的对象的方法的能力,并在模拟和现实世界中使用非常少数的例子在现场使用很少的例子快速学习新的对象类别。
translated by 谷歌翻译
Point cloud completion is a generation and estimation issue derived from the partial point clouds, which plays a vital role in the applications in 3D computer vision. The progress of deep learning (DL) has impressively improved the capability and robustness of point cloud completion. However, the quality of completed point clouds is still needed to be further enhanced to meet the practical utilization. Therefore, this work aims to conduct a comprehensive survey on various methods, including point-based, convolution-based, graph-based, and generative model-based approaches, etc. And this survey summarizes the comparisons among these methods to provoke further research insights. Besides, this review sums up the commonly used datasets and illustrates the applications of point cloud completion. Eventually, we also discussed possible research trends in this promptly expanding field.
translated by 谷歌翻译
在现实世界中操纵体积变形物体,例如毛绒玩具和披萨面团,由于无限形状的变化,非刚性运动和部分可观察性带来了重大挑战。我们引入酸,这是一种基于结构性隐式神经表示的容量变形物体的动作条件视觉动力学模型。酸整合了两种新技术:动作条件动力学和基于大地测量的对比度学习的隐式表示。为了代表部分RGB-D观测值的变形动力学,我们学习了占用和基于流动的正向动态的隐式表示。为了准确识别在大型非刚性变形下的状态变化,我们通过新的基于大地测量的对比损失来学习一个对应嵌入场。为了评估我们的方法,我们开发了一个模拟框架,用于在逼真的场景中操纵复杂的可变形形状和一个基准测试,其中包含17,000多种动作轨迹,这些轨迹具有六种类型的毛绒玩具和78种变体。我们的模型在现有方法上实现了几何,对应和动态预测的最佳性能。酸动力学模型已成功地用于目标条件可变形的操纵任务,从而使任务成功率比最强的基线提高了30%。此外,我们将模拟训练的酸模型直接应用于现实世界对象,并在将它们操纵为目标配置中显示成功。有关更多结果和信息,请访问https://b0ku1.github.io/acid/。
translated by 谷歌翻译
同时对象识别和姿势估计是机器人安全与人类和环境安全相互作用的两个关键功能。尽管对象识别和姿势估计都使用视觉输入,但大多数最先进的问题将它们作为两个独立的问题解决,因为前者需要视图不变的表示,而对象姿势估计需要一个与观点有关的描述。如今,多视图卷积神经网络(MVCNN)方法显示出最新的分类性能。尽管已广泛探索了MVCNN对象识别,但对多视图对象构成估计方法的研究很少,而同时解决这两个问题的研究更少。 MVCNN方法中虚拟摄像机的姿势通常是预先定义的,从而绑定了这种方法的应用。在本文中,我们提出了一种能够同时处理对象识别和姿势估计的方法。特别是,我们开发了一个深度的对象不合时宜的熵估计模型,能够预测给定3D对象的最佳观点。然后将对象的视图馈送到网络中,以同时预测目标对象的姿势和类别标签。实验结果表明,从此类位置获得的观点足以达到良好的精度得分。此外,我们设计了现实生活中的饮料场景,以证明拟议方法在真正的机器人任务中的运作效果如何。代码可在线获得:github.com/subhadityamukherjee/more_mvcnn
translated by 谷歌翻译
变压器在自然语言处理中的成功最近引起了计算机视觉领域的关注。由于能够学习长期依赖性,变压器已被用作广泛使用的卷积运算符的替代品。事实证明,这种替代者在许多任务中都取得了成功,其中几种最先进的方法依靠变压器来更好地学习。在计算机视觉中,3D字段还见证了使用变压器来增加3D卷积神经网络和多层感知器网络的增加。尽管许多调查都集中在视力中的变压器上,但由于与2D视觉相比,由于数据表示和处理的差异,3D视觉需要特别注意。在这项工作中,我们介绍了针对不同3D视觉任务的100多种变压器方法的系统和彻底审查,包括分类,细分,检测,完成,姿势估计等。我们在3D Vision中讨论了变形金刚的设计,该设计使其可以使用各种3D表示形式处理数据。对于每个应用程序,我们强调了基于变压器的方法的关键属性和贡献。为了评估这些方法的竞争力,我们将它们的性能与12个3D基准测试的常见非转化方法进行了比较。我们通过讨论3D视觉中变压器的不同开放方向和挑战来结束调查。除了提出的论文外,我们的目标是频繁更新最新的相关论文及其相应的实现:https://github.com/lahoud/3d-vision-transformers。
translated by 谷歌翻译
We formulate grasp learning as a neural field and present Neural Grasp Distance Fields (NGDF). Here, the input is a 6D pose of a robot end effector and output is a distance to a continuous manifold of valid grasps for an object. In contrast to current approaches that predict a set of discrete candidate grasps, the distance-based NGDF representation is easily interpreted as a cost, and minimizing this cost produces a successful grasp pose. This grasp distance cost can be incorporated directly into a trajectory optimizer for joint optimization with other costs such as trajectory smoothness and collision avoidance. During optimization, as the various costs are balanced and minimized, the grasp target is allowed to smoothly vary, as the learned grasp field is continuous. In simulation benchmarks with a Franka arm, we find that joint grasping and planning with NGDF outperforms baselines by 63% execution success while generalizing to unseen query poses and unseen object shapes. Project page: https://sites.google.com/view/neural-grasp-distance-fields.
translated by 谷歌翻译
在本文中,我们提出了一个基于变压器的架构,即TF-Grasp,用于机器人Grasp检测。开发的TF-Grasp框架具有两个精心设计的设计,使其非常适合视觉抓握任务。第一个关键设计是,我们采用本地窗口的注意来捕获本地上下文信息和可抓取对象的详细特征。然后,我们将跨窗户注意力应用于建模遥远像素之间的长期依赖性。对象知识,环境配置和不同视觉实体之间的关系汇总以进行后续的掌握检测。第二个关键设计是,我们构建了具有跳过连接的层次编码器架构,从编码器到解码器提供了浅特征,以启用多尺度功能融合。由于具有强大的注意力机制,TF-Grasp可以同时获得局部信息(即对象的轮廓),并建模长期连接,例如混乱中不同的视觉概念之间的关系。广泛的计算实验表明,TF-GRASP在康奈尔(Cornell)和雅克(Jacquard)握把数据集上分别获得了较高的结果与最先进的卷积模型,并获得了97.99%和94.6%的较高精度。使用7DOF Franka Emika Panda机器人进行的现实世界实验也证明了其在各种情况下抓住看不见的物体的能力。代码和预培训模型将在https://github.com/wangshaosun/grasp-transformer上找到
translated by 谷歌翻译
We present a unified and compact representation for object rendering, 3D reconstruction, and grasp pose prediction that can be inferred from a single image within a few seconds. We achieve this by leveraging recent advances in the Neural Radiance Field (NeRF) literature that learn category-level priors and fine-tune on novel objects with minimal data and time. Our insight is that we can learn a compact shape representation and extract meaningful additional information from it, such as grasping poses. We believe this to be the first work to retrieve grasping poses directly from a NeRF-based representation using a single viewpoint (RGB-only), rather than going through a secondary network and/or representation. When compared to prior art, our method is two to three orders of magnitude smaller while achieving comparable performance at view reconstruction and grasping. Accompanying our method, we also propose a new dataset of rendered shoes for training a sim-2-real NeRF method with grasping poses for different widths of grippers.
translated by 谷歌翻译