Humans form mental images of 3D scenes to support counterfactual imagination, planning, and motor control. Our abilities to predict the appearance and affordance of the scene from previously unobserved viewpoints aid us in performing manipulation tasks (e.g., 6-DoF kitting) with a level of ease that is currently out of reach for existing robot learning frameworks. In this work, we aim to build artificial systems that can analogously plan actions on top of imagined images. To this end, we introduce Mental Imagery for Robotic Affordances (MIRA), an action reasoning framework that optimizes actions with novel-view synthesis and affordance prediction in the loop. Given a set of 2D RGB images, MIRA builds a consistent 3D scene representation, through which we synthesize novel orthographic views amenable to pixel-wise affordances prediction for action optimization. We illustrate how this optimization process enables us to generalize to unseen out-of-plane rotations for 6-DoF robotic manipulation tasks given a limited number of demonstrations, paving the way toward machines that autonomously learn to understand the world around them for planning actions.
translated by 谷歌翻译
In this paper, we examine the problem of visibility-aware robot navigation among movable obstacles (VANAMO). A variant of the well-known NAMO robotic planning problem, VANAMO puts additional visibility constraints on robot motion and object movability. This new problem formulation lifts the restrictive assumption that the map is fully visible and the object positions are fully known. We provide a formal definition of the VANAMO problem and propose the Look and Manipulate Backchaining (LaMB) algorithm for solving such problems. LaMB has a simple vision-based API that makes it more easily transferable to real-world robot applications and scales to the large 3D environments. To evaluate LaMB, we construct a set of tasks that illustrate the complex interplay between visibility and object movability that can arise in mobile base manipulation problems in unknown environments. We show that LaMB outperforms NAMO and visibility-aware motion planning approaches as well as simple combinations of them on complex manipulation problems with partial observability.
translated by 谷歌翻译
Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.
translated by 谷歌翻译
Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion, a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models - while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.
translated by 谷歌翻译
对环境变化进行推理的能力对于长时间运行的机器人至关重要。期望代理在操作过程中捕获变化,以便可以采取行动以确保工作会议的平稳进展。但是,由于低观测重叠和漂移对象关联,不同的视角和累积的本地化错误使机器人可以轻松地检测周围世界的变化。在本文中,基于最近提出的类别级神经描述符字段(NDFS),我们开发了一种对象级在线变更检测方法,该方法可用于部分重叠观测和嘈杂的本地化结果。利用形状的完成功能和NDF的SE(3) - 均衡性,我们表示具有紧凑形状代码的对象,从部分观测中编码完整的对象形状。然后,基于从NDF恢复的对象中心以快速查询对象社区的对象中心,将对象组织在空间树结构中。通过通过形状代码相似性与对象关联并比较局部对象 - 邻居空间布局,我们提出的方法证明了对低观察重叠和本地化噪声的鲁棒性。与多种基线方法相比,我们对合成和现实世界序列进行实验,并获得改进的变化检测结果。项目网页:https://yilundu.github.io/ndf_change
translated by 谷歌翻译
人类的感知可靠地识别3D场景的可移动和不可移动的部分,并通过不完整的观测来完成对象和背景的3D结构。我们不是通过标记的示例来学习此技能,而只是通过观察对象移动来学习。在这项工作中,我们提出了一种方法,该方法在训练时间观察未标记的多视图视频,并学会绘制对复杂场景的单个图像观察,例如带有汽车的街道,将其绘制为3D神经场景表示,该表演将其分解为可移动和可移动和不可移动的零件,同时合理地完成其3D结构。我们通过2D神经地面计划分别参数可移动和不可移动的场景部分。这些地面计划是与接地平面对齐的2D网格,可以将其局部解码为3D神经辐射场。我们的模型通过神经渲染受过训练的自我监督。我们证明,使用简单的启发式方法,例如提取对象以对象的3D表示,新颖的视图合成,实例段和3D边界框预测,预测,预测,诸如提取以对象为中心的3D表示,诸如提取街道规模的3D场景中的各种下游任务可以实现各种下游任务。强调其作为数据效率3D场景理解模型的骨干的价值。这种分离进一步通过对象操纵(例如删除,插入和刚体运动)进行了现场编辑。
translated by 谷歌翻译
在本文中,我们通过查看RGBD图像以及有关配对问题和答案的推理来解决3D概念接地(即细分和学习视觉概念)的挑战性问题。现有的视觉推理方法通常利用监督的方法来提取概念接地的2D分割面具。相比之下,人类能够将图像的基础3D表示基础。但是,传统上推断出的3D表示(例如,点云,体素格林和网格)无法灵活地捕获连续的3D特征,从而使基于所指对象的语言描述对3D区域的地面概念充满挑战。为了解决这两个问题,我们建议利用神经领域的连续,可区分的性质来细分和学习概念。具体而言,场景中的每个3D坐标都表示为高维描述符。然后,可以通过计算3D坐标的描述符向量与语言概念的向量嵌入之间的相似性来执行概念接地,这使得能够以不同的方式在神经领域中共同学习分割和概念。结果,3D语义和实例分割都可以直接通过使用神经场顶上的一组定义的神经操作员来回答监督(例如,过滤和计数)。实验结果表明,我们提出的框架优于语义和实例细分任务上的无监督/语言介导的分割模型,并且在具有挑战性的3D意识到的视觉推理任务上优于现有模型。此外,我们的框架可以很好地概括为看不见的形状类别和真正的扫描。
translated by 谷歌翻译
深度学习在复杂的模式识别任务上表现出色,例如图像分类和对象识别。但是,它与需要非平凡推理的任务(例如算法计算)斗争。人类能够通过迭代推理来解决此类任务 - 花更多的时间思考更艰难的任务。但是,大多数现有的神经网络都表现出由神经网络体系结构控制的固定计算预算,从而阻止了对更艰难任务的其他计算处理。在这项工作中,我们为神经网络提供了一个新的迭代推理框架。我们训练神经网络以在所有输出上参数化能量景观,并实施迭代推理的每个步骤,作为能量最小化步骤,以找到最小的能量解决方案。通过将推理作为一个能量最小化问题,对于导致更复杂的能源景观的更严重的问题,我们可以通过运行更复杂的优化程序来调整我们的基本计算预算。我们从经验上说明,我们的迭代推理方法可以在图和连续域中解决更准确和可推广的算法推理任务。最后,我们说明我们的方法可以递归解决需要嵌套推理的算法问题
translated by 谷歌翻译
大型文本引导的扩散模型(例如Dalle-2)能够在自然语言描述下生成令人惊叹的影像图像。尽管这样的模型非常灵活,但它们很难理解某些概念的组成,例如使不同对象的属性或对象之间的关系混淆。在本文中,我们提出了一种使用扩散模型的替代结构化方法来生成组成。图像是通过组成一组扩散模型来生成的,每个扩散模型都对图像的某个组件进行建模。为此,我们将扩散模型解释为基于能量的模型,其中可以明确组合能量函数定义的数据分布。所提出的方法可以在测试时间生成比训练中看到的场景要复杂得多,构成句子描述,对象关系,人面部属性,甚至对在现实世界中很少见的新组合进行推广。我们进一步说明了如何使用我们的方法来组成预训练的文本引导的扩散模型,并生成包含输入描述中描述的所有细节的影像图像,包括对Dalle-2表现出的某些对象属性的结合。这些结果表明,所提出的方法在促进视觉产生的结构化概括方面的有效性。项目页面:https://energy-lase-model.github.io/compositional-visual-generation-with-composable-diffusion-models/
translated by 谷歌翻译
Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.
translated by 谷歌翻译