Humans form mental images of 3D scenes to support counterfactual imagination, planning, and motor control. Our abilities to predict the appearance and affordance of the scene from previously unobserved viewpoints aid us in performing manipulation tasks (e.g., 6-DoF kitting) with a level of ease that is currently out of reach for existing robot learning frameworks. In this work, we aim to build artificial systems that can analogously plan actions on top of imagined images. To this end, we introduce Mental Imagery for Robotic Affordances (MIRA), an action reasoning framework that optimizes actions with novel-view synthesis and affordance prediction in the loop. Given a set of 2D RGB images, MIRA builds a consistent 3D scene representation, through which we synthesize novel orthographic views amenable to pixel-wise affordances prediction for action optimization. We illustrate how this optimization process enables us to generalize to unseen out-of-plane rotations for 6-DoF robotic manipulation tasks given a limited number of demonstrations, paving the way toward machines that autonomously learn to understand the world around them for planning actions.
In this paper, we examine the problem of visibility-aware robot navigation among movable obstacles (VANAMO). A variant of the well-known NAMO robotic planning problem, VANAMO puts additional visibility constraints on robot motion and object movability. This new problem formulation lifts the restrictive assumption that the map is fully visible and the object positions are fully known. We provide a formal definition of the VANAMO problem and propose the Look and Manipulate Backchaining (LaMB) algorithm for solving such problems. LaMB has a simple vision-based API that makes it more easily transferable to real-world robot applications and scales to the large 3D environments. To evaluate LaMB, we construct a set of tasks that illustrate the complex interplay between visibility and object movability that can arise in mobile base manipulation problems in unknown environments. We show that LaMB outperforms NAMO and visibility-aware motion planning approaches as well as simple combinations of them on complex manipulation problems with partial observability.
Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.
Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion, a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models - while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.
对环境变化进行推理的能力对于长时间运行的机器人至关重要。期望代理在操作过程中捕获变化,以便可以采取行动以确保工作会议的平稳进展。但是,由于低观测重叠和漂移对象关联,不同的视角和累积的本地化错误使机器人可以轻松地检测周围世界的变化。在本文中,基于最近提出的类别级神经描述符字段(NDFS),我们开发了一种对象级在线变更检测方法,该方法可用于部分重叠观测和嘈杂的本地化结果。利用形状的完成功能和NDF的SE(3) - 均衡性,我们表示具有紧凑形状代码的对象,从部分观测中编码完整的对象形状。然后,基于从NDF恢复的对象中心以快速查询对象社区的对象中心,将对象组织在空间树结构中。通过通过形状代码相似性与对象关联并比较局部对象 - 邻居空间布局,我们提出的方法证明了对低观察重叠和本地化噪声的鲁棒性。与多种基线方法相比,我们对合成和现实世界序列进行实验,并获得改进的变化检测结果。项目网页:
深度学习在复杂的模式识别任务上表现出色,例如图像分类和对象识别。但是,它与需要非平凡推理的任务(例如算法计算)斗争。人类能够通过迭代推理来解决此类任务 - 花更多的时间思考更艰难的任务。但是,大多数现有的神经网络都表现出由神经网络体系结构控制的固定计算预算,从而阻止了对更艰难任务的其他计算处理。在这项工作中,我们为神经网络提供了一个新的迭代推理框架。我们训练神经网络以在所有输出上参数化能量景观,并实施迭代推理的每个步骤,作为能量最小化步骤,以找到最小的能量解决方案。通过将推理作为一个能量最小化问题,对于导致更复杂的能源景观的更严重的问题,我们可以通过运行更复杂的优化程序来调整我们的基本计算预算。我们从经验上说明,我们的迭代推理方法可以在图和连续域中解决更准确和可推广的算法推理任务。最后,我们说明我们的方法可以递归解决需要嵌套推理的算法问题
Model-based reinforcement learning methods often use learning only for the purpose of estimating an approximate dynamics model, offloading the rest of the decision-making work to classical trajectory optimizers. While conceptually simple, this combination has a number of empirical shortcomings, suggesting that learned models may not be well-suited to standard trajectory optimization. In this paper, we consider what it would look like to fold as much of the trajectory optimization pipeline as possible into the modeling problem, such that sampling from the model and planning with it become nearly identical. The core of our technical approach lies in a diffusion probabilistic model that plans by iteratively denoising trajectories. We show how classifier-guided sampling and image inpainting can be reinterpreted as coherent planning strategies, explore the unusual and useful properties of diffusion-based planning methods, and demonstrate the effectiveness of our framework in control settings that emphasize long-horizon decision-making and test-time flexibility.
