While reinforcement learning (RL) has become a more popular approach for robotics, designing sufficiently informative reward functions for complex tasks has proven to be extremely difficult due their inability to capture human intent and policy exploitation. Preference based RL algorithms seek to overcome these challenges by directly learning reward functions from human feedback. Unfortunately, prior work either requires an unreasonable number of queries implausible for any human to answer or overly restricts the class of reward functions to guarantee the elicitation of the most informative queries, resulting in models that are insufficiently expressive for realistic robotics tasks. Contrary to most works that focus on query selection to \emph{minimize} the amount of data required for learning reward functions, we take an opposite approach: \emph{expanding} the pool of available data by viewing human-in-the-loop RL through the more flexible lens of multi-task learning. Motivated by the success of meta-learning, we pre-train preference models on prior task data and quickly adapt them for new tasks using only a handful of queries. Empirically, we reduce the amount of online feedback needed to train manipulation policies in Meta-World by 20$\times$, and demonstrate the effectiveness of our method on a real Franka Panda Robot. Moreover, this reduction in query-complexity allows us to train robot policies from actual human users. Videos of our results and code can be found at https://sites.google.com/view/few-shot-preference-rl/home.
translated by 谷歌翻译
Ensemble learning serves as a straightforward way to improve the performance of almost any machine learning algorithm. Existing deep ensemble methods usually naively train many different models and then aggregate their predictions. This is not optimal in our view from two aspects: i) Naively training multiple models adds much more computational burden, especially in the deep learning era; ii) Purely optimizing each base model without considering their interactions limits the diversity of ensemble and performance gains. We tackle these issues by proposing deep negative correlation classification (DNCC), in which the accuracy and diversity trade-off is systematically controlled by decomposing the loss function seamlessly into individual accuracy and the correlation between individual models and the ensemble. DNCC yields a deep classification ensemble where the individual estimator is both accurate and negatively correlated. Thanks to the optimized diversities, DNCC works well even when utilizing a shared network backbone, which significantly improves its efficiency when compared with most existing ensemble systems. Extensive experiments on multiple benchmark datasets and network structures demonstrate the superiority of the proposed method.
translated by 谷歌翻译
Many practical applications, such as recommender systems and learning to rank, involve solving multiple similar tasks. One example is learning of recommendation policies for users with similar movie preferences, where the users may still rank the individual movies slightly differently. Such tasks can be organized in a hierarchy, where similar tasks are related through a shared structure. In this work, we formulate this problem as a contextual off-policy optimization in a hierarchical graphical model from logged bandit feedback. To solve the problem, we propose a hierarchical off-policy optimization algorithm (HierOPO), which estimates the parameters of the hierarchical model and then acts pessimistically with respect to them. We instantiate HierOPO in linear Gaussian models, for which we also provide an efficient implementation and analysis. We prove per-task bounds on the suboptimality of the learned policies, which show a clear improvement over not using the hierarchical model. We also evaluate the policies empirically. Our theoretical and empirical results show a clear advantage of using the hierarchy over solving each task independently.
translated by 谷歌翻译
Inferring reward functions from human behavior is at the center of value alignment - aligning AI objectives with what we, humans, actually want. But doing so relies on models of how humans behave given their objectives. After decades of research in cognitive science, neuroscience, and behavioral economics, obtaining accurate human models remains an open research topic. This begs the question: how accurate do these models need to be in order for the reward inference to be accurate? On the one hand, if small errors in the model can lead to catastrophic error in inference, the entire framework of reward learning seems ill-fated, as we will never have perfect models of human behavior. On the other hand, if as our models improve, we can have a guarantee that reward accuracy also improves, this would show the benefit of more work on the modeling side. We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.
translated by 谷歌翻译
Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.
translated by 谷歌翻译
Tourette Syndrome (TS) is a behavior disorder that onsets in childhood and is characterized by the expression of involuntary movements and sounds commonly referred to as tics. Behavioral therapy is the first-line treatment for patients with TS, and it helps patients raise awareness about tic occurrence as well as develop tic inhibition strategies. However, the limited availability of therapists and the difficulties for in-home follow up work limits its effectiveness. An automatic tic detection system that is easy to deploy could alleviate the difficulties of home-therapy by providing feedback to the patients while exercising tic awareness. In this work, we propose a novel architecture (T-Net) for automatic tic detection and classification from untrimmed videos. T-Net combines temporal detection and segmentation and operates on features that are interpretable to a clinician. We compare T-Net to several state-of-the-art systems working on deep features extracted from the raw videos and T-Net achieves comparable performance in terms of average precision while relying on interpretable features needed in clinical practice.
translated by 谷歌翻译
机器人的感知目前处于在有效的潜在空间中运行的现代方法与数学建立的经典方法之间的跨道路,并提供了可解释的,可信赖的结果。在本文中,我们引入了卷积的贝叶斯内核推理(Convbki)层,该层在可分离的卷积层中明确执行贝叶斯推断,以同时提高效率,同时保持可靠性。我们将层应用于3D语义映射的任务,在该任务中,我们可以实时学习激光雷达传感器信息的语义几何概率分布。我们根据KITTI数据集的最新语义映射算法评估我们的网络,并通过类似的语义结果证明了延迟的提高。
translated by 谷歌翻译
扩散模型是图像产生和似然估计的最新方法。在这项工作中,我们将连续的时间扩散模型推广到任意的Riemannian流形,并得出了可能性估计的变异框架。在计算上,我们提出了计算可能性估计中需要的黎曼分歧的新方法。此外,在概括欧几里得案例时,我们证明,最大化该变异的下限等效于Riemannian得分匹配。从经验上讲,我们证明了Riemannian扩散模型在各种光滑的歧管上的表达能力,例如球体,Tori,双曲线和正交组。我们提出的方法在所有基准测试基准上实现了新的最先进的可能性。
translated by 谷歌翻译
th骨海星(COTS)爆发是大屏障礁(GBR)珊瑚损失的主要原因,并且正在进行实质性的监视和控制计划,以将COTS人群管理至生态可持续的水平。在本文中,我们在边缘设备上介绍了基于水下的水下数据收集和策展系统,以进行COTS监视。特别是,我们利用了基于深度学习的对象检测技术的功能,并提出了一种资源有效的COTS检测器,该检测器在边缘设备上执行检测推断,以帮助海上专家在数据收集阶段进行COTS识别。初步结果表明,可以将改善计算效率的几种策略(例如,批处理处理,帧跳过,模型输入大小)组合在一起,以在Edge硬件上运行拟议的检测模型,资源消耗较低,信息损失较低。
translated by 谷歌翻译
连续归一化流(CNF)是一类生成模型,可以通过求解普通的微分方程(ODE)将先验分布转换为模型分布。我们建议通过最大程度地减少概率路径差异(PPD)来训练CNF,这是CNF产生的概率密度路径与目标概率密度路径之间的新型差异家族。 PPD是使用对数质量保护公式制定的,该公式是线性的一阶部分微分方程,将对数目标概率和CNF的定义向量场进行配方。 PPD比现有方法具有多个关键好处:它避免了在迭代中解决颂歌的需求,很容易应用于歧管数据,比例到高维度,并与大型目标路径兼容,该目标路径在有限的时间内插值纯噪声和数据。从理论上讲,PPD显示为结合经典概率差异。从经验上讲,我们表明,通过最小化PPD实现最新的CNF在现有的低维歧管基准上获得了最新的可能性和样品质量,并且是生成模型以扩展到中度高维歧管的第一个示例。
translated by 谷歌翻译