元梯度提供了一种一般方法,以优化增强学习算法(RL)算法的元参数。元梯度的估计对于这些元算法的性能至关重要,并且已经在MAML式短距离元元RL问题的情况下进行了研究。在这种情况下,先前的工作调查了对RL目标的Hessian的估计,并通过进行抽样校正来解决信贷分配问题,以解决预先适应行为。但是,我们表明,例如由DICE及其变体实施的Hessian估计始终会增加偏差,还可以为元梯度估计增加差异。同时,在重要的长马设置中,元梯度估计的研究较少,在这种情况下,通过完整的内部优化轨迹的反向传播是不可行的。我们研究了截短的反向传播和采样校正引起的偏见和差异权衡,并与进化策略进行了比较,这是最近流行的长期替代策略。虽然先前的工作隐含地选择了这个偏见变化空间中的点,但我们解散了偏见和差异的来源,并提出了将现有估计器相互关联的经验研究。
translated by 谷歌翻译
使用环境模型和值函数,代理可以通过向不同长度展开模型来构造状态值的许多估计,并使用其值函数引导。我们的关键识别是,人们可以将这组价值估计视为一类合奏,我们称之为\ eNPH {隐式值合奏}(IVE)。因此,这些估计之间的差异可用作代理人的认知不确定性的代理;我们将此信号术语\ EMPH {Model-Value不一致}或\ EMPH {自给智而不一致。与先前的工作不同,该工作估计通过培训许多模型和/或价值函数的集合来估计不确定性,这种方法只需要在大多数基于模型的加强学习算法中学习的单一模型和价值函数。我们在单板和函数近似设置中提供了从像素的表格和函数近似设置中的经验证据是有用的(i)作为探索的信号,(ii)在分发班次下安全地行动,(iii),用于使用基于价值的规划模型。
translated by 谷歌翻译
基于模型的强化学习(RL)的主要挑战之一是决定应建模环境的哪些方面。值等价(VE)原则提出了一个简单的答案,对此问题:模型应该捕获与基于价值的规划相关的环境的方面。从技术上讲,VE基于一组策略和一组功能区分模型:如果贝尔曼运营商诱导策略,则据说模型是对环境的VE,在应用于功能时产生正确的结果。随着策略数量的增加,VE模型集缩小,最终折叠到对应于完美模型的单点。因此,VE原理的基本问题是如何选择足以规划的最小策略和功能。在本文中,我们对回答这个问题进行了重要一步。我们首先通过朝鲜钟人机运营商的$ k $申请概括为达到秩序的概念。这导致了一个VE类的家庭,尺寸随着$ k \ lightarow \ idty $而增加。在极限中,所有功能都成为价值函数,我们有一个特殊的实例化,我们称之为适当的VE或简单的PVE。与VE不同,PVE类可能包含多种型号,即使在使用所有值函数时也可以包含多个模型。至关重要的是,所有这些模型都足以规划,这意味着他们将产生最佳政策尽管他们可能忽略了环境的许多方面。我们构建用于学习PVE模型的损失函数,并认为诸如Muzero的流行算法可以被理解为最小化这种损失的上限。我们利用这一联系提出了对Muzero的修改,并表明它可以在实践中提高性能。
translated by 谷歌翻译
In many real-world settings, a team of agents must coordinate their behaviour while acting in a decentralised way. At the same time, it is often possible to train the agents in a centralised fashion in a simulated or laboratory setting, where global state information is available and communication constraints are lifted. Learning joint actionvalues conditioned on extra state information is an attractive way to exploit centralised learning, but the best strategy for then extracting decentralised policies is unclear. Our solution is QMIX, a novel value-based method that can train decentralised policies in a centralised end-to-end fashion. QMIX employs a network that estimates joint action-values as a complex non-linear combination of per-agent values that condition only on local observations. We structurally enforce that the joint-action value is monotonic in the per-agent values, which allows tractable maximisation of the joint action-value in off-policy learning, and guarantees consistency between the centralised and decentralised policies. We evaluate QMIX on a challenging set of StarCraft II micromanagement tasks, and show that QMIX significantly outperforms existing value-based multi-agent reinforcement learning methods.
translated by 谷歌翻译
Many real-world problems, such as network packet routing and the coordination of autonomous vehicles, are naturally modelled as cooperative multi-agent systems. There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actorcritic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.
translated by 谷歌翻译
Previous work has shown the potential of deep learning to predict renal obstruction using kidney ultrasound images. However, these image-based classifiers have been trained with the goal of single-visit inference in mind. We compare methods from video action recognition (i.e. convolutional pooling, LSTM, TSM) to adapt single-visit convolutional models to handle multiple visit inference. We demonstrate that incorporating images from a patient's past hospital visits provides only a small benefit for the prediction of obstructive hydronephrosis. Therefore, inclusion of prior ultrasounds is beneficial, but prediction based on the latest ultrasound is sufficient for patient risk stratification.
translated by 谷歌翻译
Applying deep learning concepts from image detection and graph theory has greatly advanced protein-ligand binding affinity prediction, a challenge with enormous ramifications for both drug discovery and protein engineering. We build upon these advances by designing a novel deep learning architecture consisting of a 3-dimensional convolutional neural network utilizing channel-wise attention and two graph convolutional networks utilizing attention-based aggregation of node features. HAC-Net (Hybrid Attention-Based Convolutional Neural Network) obtains state-of-the-art results on the PDBbind v.2016 core set, the most widely recognized benchmark in the field. We extensively assess the generalizability of our model using multiple train-test splits, each of which maximizes differences between either protein structures, protein sequences, or ligand extended-connectivity fingerprints. Furthermore, we perform 10-fold cross-validation with a similarity cutoff between SMILES strings of ligands in the training and test sets, and also evaluate the performance of HAC-Net on lower-quality data. We envision that this model can be extended to a broad range of supervised learning problems related to structure-based biomolecular property prediction. All of our software is available as open source at https://github.com/gregory-kyro/HAC-Net/.
translated by 谷歌翻译
In recent years several learning approaches to point goal navigation in previously unseen environments have been proposed. They vary in the representations of the environments, problem decomposition, and experimental evaluation. In this work, we compare the state-of-the-art Deep Reinforcement Learning based approaches with Partially Observable Markov Decision Process (POMDP) formulation of the point goal navigation problem. We adapt the (POMDP) sub-goal framework proposed by [1] and modify the component that estimates frontier properties by using partial semantic maps of indoor scenes built from images' semantic segmentation. In addition to the well-known completeness of the model-based approach, we demonstrate that it is robust and efficient in that it leverages informative, learned properties of the frontiers compared to an optimistic frontier-based planner. We also demonstrate its data efficiency compared to the end-to-end deep reinforcement learning approaches. We compare our results against an optimistic planner, ANS and DD-PPO on Matterport3D dataset using the Habitat Simulator. We show comparable, though slightly worse performance than the SOTA DD-PPO approach, yet with far fewer data.
translated by 谷歌翻译
State-of-the-art language models are often accurate on many question-answering benchmarks with well-defined questions. Yet, in real settings questions are often unanswerable without asking the user for clarifying information. We show that current SotA models often do not ask the user for clarification when presented with imprecise questions and instead provide incorrect answers or "hallucinate". To address this, we introduce CLAM, a framework that first uses the model to detect ambiguous questions, and if an ambiguous question is detected, prompts the model to ask the user for clarification. Furthermore, we show how to construct a scalable and cost-effective automatic evaluation protocol using an oracle language model with privileged information to provide clarifying information. We show that our method achieves a 20.15 percentage point accuracy improvement over SotA on a novel ambiguous question-answering answering data set derived from TriviaQA.
translated by 谷歌翻译
It is known that neural networks have the problem of being over-confident when directly using the output label distribution to generate uncertainty measures. Existing methods mainly resolve this issue by retraining the entire model to impose the uncertainty quantification capability so that the learned model can achieve desired performance in accuracy and uncertainty prediction simultaneously. However, training the model from scratch is computationally expensive and may not be feasible in many situations. In this work, we consider a more practical post-hoc uncertainty learning setting, where a well-trained base model is given, and we focus on the uncertainty quantification task at the second stage of training. We propose a novel Bayesian meta-model to augment pre-trained models with better uncertainty quantification abilities, which is effective and computationally efficient. Our proposed method requires no additional training data and is flexible enough to quantify different uncertainties and easily adapt to different application settings, including out-of-domain data detection, misclassification detection, and trustworthy transfer learning. We demonstrate our proposed meta-model approach's flexibility and superior empirical performance on these applications over multiple representative image classification benchmarks.
translated by 谷歌翻译