Bridging the 'reality gap' that separates simulated robotics from experiments on hardware could accelerate robotic research through improved data availability. This paper explores domain randomization, a simple technique for training models on simulated images that transfer to real images by randomizing rendering in the simulator. With enough variability in the simulator, the real world may appear to the model as just another variation. We focus on the task of object localization, which is a stepping stone to general robotic manipulation skills. We find that it is possible to train a real-world object detector that is accurate to 1.5 cm and robust to distractors and partial occlusions using only data from a simulator with non-realistic random textures. To demonstrate the capabilities of our detectors, we show they can be used to perform grasping in a cluttered environment. To our knowledge, this is the first successful transfer of a deep neural network trained only on simulated RGB images (without pre-training on real images) to the real world for the purpose of robotic control.
translated by 谷歌翻译
在非结构化环境中工作的机器人必须能够感知和解释其周围环境。机器人技术领域基于深度学习模型的主要障碍之一是缺乏针对不同工业应用的特定领域标记数据。在本文中,我们提出了一种基于域随机化的SIM2REAL传输学习方法,用于对象检测,可以自动生成任意大小和对象类型的标记的合成数据集。随后,对最先进的卷积神经网络Yolov4进行了训练,以检测不同类型的工业对象。通过提出的域随机化方法,我们可以在零射击和单次转移的情况下分别缩小现实差距,分别达到86.32%和97.38%的MAP50分数,其中包含190个真实图像。在GEFORCE RTX 2080 TI GPU上,数据生成过程的每图像少于0.5 s,培训持续约12H,这使其方便地用于工业使用。我们的解决方案符合工业需求,因为它可以通过仅使用1个真实图像进行培训来可靠地区分相似的对象类别。据我们所知,这是迄今为止满足这些约束的唯一工作。
translated by 谷歌翻译
Figure 1: A five-fingered humanoid hand trained with reinforcement learning manipulating a block from an initial configuration to a goal configuration using vision for sensing.
translated by 谷歌翻译
We present a retrospective on the state of Embodied AI research. Our analysis focuses on 13 challenges presented at the Embodied AI Workshop at CVPR. These challenges are grouped into three themes: (1) visual navigation, (2) rearrangement, and (3) embodied vision-and-language. We discuss the dominant datasets within each theme, evaluation metrics for the challenges, and the performance of state-of-the-art models. We highlight commonalities between top approaches to the challenges and identify potential future directions for Embodied AI research.
translated by 谷歌翻译
We introduce MegaPose, a method to estimate the 6D pose of novel objects, that is, objects unseen during training. At inference time, the method only assumes knowledge of (i) a region of interest displaying the object in the image and (ii) a CAD model of the observed object. The contributions of this work are threefold. First, we present a 6D pose refiner based on a render&compare strategy which can be applied to novel objects. The shape and coordinate system of the novel object are provided as inputs to the network by rendering multiple synthetic views of the object's CAD model. Second, we introduce a novel approach for coarse pose estimation which leverages a network trained to classify whether the pose error between a synthetic rendering and an observed image of the same object can be corrected by the refiner. Third, we introduce a large-scale synthetic dataset of photorealistic images of thousands of objects with diverse visual and shape properties and show that this diversity is crucial to obtain good generalization performance on novel objects. We train our approach on this large synthetic dataset and apply it without retraining to hundreds of novel objects in real images from several pose estimation benchmarks. Our approach achieves state-of-the-art performance on the ModelNet and YCB-Video datasets. An extensive evaluation on the 7 core datasets of the BOP challenge demonstrates that our approach achieves performance competitive with existing approaches that require access to the target objects during training. Code, dataset and trained models are available on the project page: https://megapose6d.github.io/.
translated by 谷歌翻译
Recent 3D-based manipulation methods either directly predict the grasp pose using 3D neural networks, or solve the grasp pose using similar objects retrieved from shape databases. However, the former faces generalizability challenges when testing with new robot arms or unseen objects; and the latter assumes that similar objects exist in the databases. We hypothesize that recent 3D modeling methods provides a path towards building digital replica of the evaluation scene that affords physical simulation and supports robust manipulation algorithm learning. We propose to reconstruct high-quality meshes from real-world point clouds using state-of-the-art neural surface reconstruction method (the Real2Sim step). Because most simulators take meshes for fast simulation, the reconstructed meshes enable grasp pose labels generation without human efforts. The generated labels can train grasp network that performs robustly in the real evaluation scene (the Sim2Real step). In synthetic and real experiments, we show that the Real2Sim2Real pipeline performs better than baseline grasp networks trained with a large dataset and a grasp sampling method with retrieval-based reconstruction. The benefit of the Real2Sim2Real pipeline comes from 1) decoupling scene modeling and grasp sampling into sub-problems, and 2) both sub-problems can be solved with sufficiently high quality using recent 3D learning algorithms and mesh-based physical simulation techniques.
translated by 谷歌翻译
We describe a learning-based approach to handeye coordination for robotic grasping from monocular images. To learn hand-eye coordination for grasping, we trained a large convolutional neural network to predict the probability that task-space motion of the gripper will result in successful grasps, using only monocular camera images and independently of camera calibration or the current robot pose. This requires the network to observe the spatial relationship between the gripper and objects in the scene, thus learning hand-eye coordination. We then use this network to servo the gripper in real time to achieve successful grasps. To train our network, we collected over 800,000 grasp attempts over the course of two months, using between 6 and 14 robotic manipulators at any given time, with differences in camera placement and hardware. Our experimental evaluation demonstrates that our method achieves effective real-time control, can successfully grasp novel objects, and corrects mistakes by continuous servoing.
translated by 谷歌翻译
在这项工作中,我们通过利用3D Suite Blender生产具有6D姿势的合成RGBD图像数据集来提出数据生成管道。提出的管道可以有效地生成大量的照片现实的RGBD图像,以了解感兴趣的对象。此外,引入了域随机化技术的集合来弥合真实数据和合成数据之间的差距。此外,我们通过整合对象检测器Yolo-V4微型和6D姿势估计算法PVN3D来开发实时的两阶段6D姿势估计方法,用于时间敏感的机器人应用。借助提出的数据生成管道,我们的姿势估计方法可以仅使用没有任何预训练模型的合成数据从头开始训练。在LineMod数据集评估时,与最先进的方法相比,所得网络显示出竞争性能。我们还证明了在机器人实验中提出的方法,在不同的照明条件下从混乱的背景中抓住家用物体。
translated by 谷歌翻译
机器人社区已经开始严重依赖越来越逼真的3D模拟器,以便在大量数据上进行大规模培训机器人。但是,一旦机器人部署在现实世界中,仿真差距以及现实世界的变化(例如,灯,物体位移)导致错误。在本文中,我们介绍了SIM2Realviz,这是一种视觉分析工具,可以帮助专家了解并减少机器人EGO-POSE估计任务的这种差距,即使用训练型模型估计机器人的位置。 Sim2Realviz显示了给定模型的详细信息以及在模拟和现实世界中的实例的性能。专家可以识别在给定位置影响模型预测的环境差异,并通过与模型假设的直接交互来探索来解决它。我们详细介绍了工具的设计,以及与对平均偏差的回归利用以及如何解决的案例研究以及如何解决,以及模型如何被诸如自行车等地标的消失的扰动。
translated by 谷歌翻译
深度学习的兴起导致机器人研究中的范式转变,有利于需要大量数据的方法。在物理平台上生成这样的数据集是昂贵的。因此,最先进的方法在模拟中学习,其中数据生成快速以及廉价并随后将知识转移到真实机器人(SIM-to-Real)。尽管变得越来越真实,但所有模拟器都是基于模型的施工,因此不可避免地不完善。这提出了如何修改模拟器以促进学习机器人控制政策的问题,并克服模拟与现实之间的不匹配,通常称为“现实差距”。我们对机器人学的SIM-Teal研究提供了全面的审查,专注于名为“域随机化”的技术,这是一种从随机仿真学习的方法。
translated by 谷歌翻译
基于视觉的机器人组装是一项至关重要但具有挑战性的任务,因为与多个对象的相互作用需要高水平的精度。在本文中,我们提出了一个集成的6D机器人系统,以感知,掌握,操纵和组装宽度,以紧密的公差。为了提供仅在现成的RGB解决方案的情况下,我们的系统建立在单眼6D对象姿势估计网络上,该估计网络仅使用合成图像训练,该图像利用了基于物理的渲染。随后,提出了姿势引导的6D转换以及无碰撞组装来构建具有任意初始姿势的任何设计结构。我们的新型3轴校准操作通过解开6D姿势估计和机器人组件进一步提高了精度和鲁棒性。定量和定性结果都证明了我们提出的6D机器人组装系统的有效性。
translated by 谷歌翻译
仿真最近已成为深度加强学习,以安全有效地从视觉和预防性投入获取一般和复杂的控制政策的关键。尽管它与环境互动直接关系,但通常认为触觉信息通常不会被认为。在这项工作中,我们展示了一套针对触觉机器人和加强学习量身定制的模拟环境。提供了一种简单且快速的模拟光学触觉传感器的方法,其中高分辨率接触几何形状表示为深度图像。近端策略优化(PPO)用于学习所有考虑任务的成功策略。数据驱动方法能够将实际触觉传感器的当前状态转换为对应的模拟深度图像。此策略在物理机器人上实时控制循环中实现,以演示零拍摄的SIM-TO-REAL策略转移,以触摸感的几个物理交互式任务。
translated by 谷歌翻译
We present a new dataset for 6-DoF pose estimation of known objects, with a focus on robotic manipulation research. We propose a set of toy grocery objects, whose physical instantiations are readily available for purchase and are appropriately sized for robotic grasping and manipulation. We provide 3D scanned textured models of these objects, suitable for generating synthetic training data, as well as RGBD images of the objects in challenging, cluttered scenes exhibiting partial occlusion, extreme lighting variations, multiple instances per image, and a large variety of poses. Using semi-automated RGBD-to-model texture correspondences, the images are annotated with ground truth poses accurate within a few millimeters. We also propose a new pose evaluation metric called ADD-H based on the Hungarian assignment algorithm that is robust to symmetries in object geometry without requiring their explicit enumeration. We share pre-trained pose estimators for all the toy grocery objects, along with their baseline performance on both validation and test sets. We offer this dataset to the community to help connect the efforts of computer vision researchers with the needs of roboticists.
translated by 谷歌翻译
机器人仿真一直是数据驱动的操作任务的重要工具。但是,大多数现有的仿真框架都缺乏与触觉传感器的物理相互作用的高效和准确模型,也没有逼真的触觉模拟。这使得基于触觉的操纵任务的SIM转交付仍然具有挑战性。在这项工作中,我们通过建模接触物理学来整合机器人动力学和基于视觉的触觉传感器的模拟。该触点模型使用机器人最终效应器上的模拟接触力来告知逼真的触觉输出。为了消除SIM到真实传输差距,我们使用现实世界数据校准了机器人动力学,接触模型和触觉光学模拟器的物理模拟器,然后我们在零摄像机上演示了系统的有效性 - 真实掌握稳定性预测任务,在各种对象上,我们达到平均准确性为90.7%。实验揭示了将我们的模拟框架应用于更复杂的操纵任务的潜力。我们在https://github.com/cmurobotouch/taxim/tree/taxim-robot上开放仿真框架。
translated by 谷歌翻译
视觉感知任务通常需要大量的标记数据,包括3D姿势和图像空间分割掩码。创建此类培训数据集的过程可能很难或耗时,可以扩展到一般使用的功效。考虑对刚性对象的姿势估计的任务。在大型公共数据集中接受培训时,基于神经网络的深层方法表现出良好的性能。但是,将这些网络调整为其他新颖对象,或针对不同环境的现有模型进行微调,需要大量的时间投资才能产生新标记的实例。为此,我们提出了ProgressLabeller作为一种方法,以更有效地以可扩展的方式从彩色图像序列中生成大量的6D姿势训练数据。 ProgressLabeller还旨在支持透明或半透明的对象,以深度密集重建的先前方法将失败。我们通过快速创建一个超过1M样品的数据集来证明ProgressLabeller的有效性,我们将其微调一个最先进的姿势估计网络,以显着提高下游机器人的抓地力。 ProgressLabeller是https://github.com/huijiezh/progresslabeller的开放源代码。
translated by 谷歌翻译
Policy search methods can allow robots to learn control policies for a wide range of tasks, but practical applications of policy search often require hand-engineered components for perception, state estimation, and low-level control. In this paper, we aim to answer the following question: does training the perception and control systems jointly end-toend provide better performance than training each component separately? To this end, we develop a method that can be used to learn policies that map raw image observations directly to torques at the robot's motors. The policies are represented by deep convolutional neural networks (CNNs) with 92,000 parameters, and are trained using a guided policy search method, which transforms policy search into supervised learning, with supervision provided by a simple trajectory-centric reinforcement learning method. We evaluate our method on a range of real-world manipulation tasks that require close coordination between vision and control, such as screwing a cap onto a bottle, and present simulated comparisons to a range of prior policy search methods.
translated by 谷歌翻译
The accurate detection and grasping of transparent objects are challenging but of significance to robots. Here, a visual-tactile fusion framework for transparent object grasping under complex backgrounds and variant light conditions is proposed, including the grasping position detection, tactile calibration, and visual-tactile fusion based classification. First, a multi-scene synthetic grasping dataset generation method with a Gaussian distribution based data annotation is proposed. Besides, a novel grasping network named TGCNN is proposed for grasping position detection, showing good results in both synthetic and real scenes. In tactile calibration, inspired by human grasping, a fully convolutional network based tactile feature extraction method and a central location based adaptive grasping strategy are designed, improving the success rate by 36.7% compared to direct grasping. Furthermore, a visual-tactile fusion method is proposed for transparent objects classification, which improves the classification accuracy by 34%. The proposed framework synergizes the advantages of vision and touch, and greatly improves the grasping efficiency of transparent objects.
translated by 谷歌翻译
Recently, the use of synthetic training data has been on the rise as it offers correctly labelled datasets at a lower cost. The downside of this technique is that the so-called domain gap between the real target images and synthetic training data leads to a decrease in performance. In this paper, we attempt to provide a holistic overview of how to use synthetic data for object detection. We analyse aspects of generating the data as well as techniques used to train the models. We do so by devising a number of experiments, training models on the Dataset of Industrial Metal Objects (DIMO). This dataset contains both real and synthetic images. The synthetic part has different subsets that are either exact synthetic copies of the real data or are copies with certain aspects randomised. This allows us to analyse what types of variation are good for synthetic training data and which aspects should be modelled to closely match the target data. Furthermore, we investigate what types of training techniques are beneficial towards generalisation to real data, and how to use them. Additionally, we analyse how real images can be leveraged when training on synthetic images. All these experiments are validated on real data and benchmarked to models trained on real data. The results offer a number of interesting takeaways that can serve as basic guidelines for using synthetic data for object detection. Code to reproduce results is available at https://github.com/EDM-Research/DIMO_ObjectDetection.
translated by 谷歌翻译
Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter.
translated by 谷歌翻译
深度学习的关键批评之一是,需要大量昂贵且难以获得的训练数据,以便培训具有高性能和良好的概率功能的模型。专注于通过场景坐标回归(SCR)的单眼摄像机姿势估计的任务,我们描述了一种新的方法,用于相机姿势估计(舞蹈)网络的域改编,这使得培训模型无需访问目标任务上的任何标签。舞蹈需要未标记的图像(没有已知的姿势,订购或场景坐标标签)和空间的3D表示(例如,扫描点云),这两者都可以使用现成的商品硬件最少的努力来捕获。舞蹈渲染从3D模型标记的合成图像,通过应用无监督的图像级域适应技术(未配对图像到图像转换)来桥接合成和实图像之间的不可避免的域间隙。在实际图像上进行测试时,舞蹈培训的SCR模型在成本的一小部分中对其完全监督的对应物(在两种情况下使用PNP-RANSAC进行最终姿势估算的情况下)进行了相当的性能。我们的代码和数据集可以在https://github.com/jacklangerman/dance获得
translated by 谷歌翻译